Files
rippled/OpenTelemetryPlan/09-data-collection-reference.md
Pratik Mankawde 4db67bc191 Phase 10: Synthetic workload generation and telemetry validation tools
Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:

Task 10.1 — Multi-node test harness:
  - docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
    Tempo, Prometheus, Loki, Grafana)
  - generate-validator-keys.sh for automated key generation
  - xrpld-validator.cfg.template for node configuration

Task 10.2 — RPC load generator:
  - rpc_load_generator.py with WebSocket client, configurable rates,
    realistic command distribution (40% health, 30% wallet, 15% explorer,
    10% tx lookups, 5% DEX), W3C traceparent injection

Task 10.3 — Transaction submitter:
  - tx_submitter.py with 10 transaction types (Payment, OfferCreate,
    OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
    EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts

Task 10.4 — Telemetry validation suite:
  - validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
    log-trace correlation (Loki), dashboards (Grafana)
  - expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
  - expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)

Task 10.5 — Performance benchmark suite:
  - benchmark.sh for baseline vs telemetry comparison
  - collect_system_metrics.sh for CPU/memory/latency sampling
  - Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus

Task 10.6 — CI integration:
  - telemetry-validation.yml GitHub Actions workflow
  - run-full-validation.sh orchestrator script
  - Manual trigger + telemetry branch auto-trigger

Task 10.7 — Documentation:
  - workload/README.md with quick start and tool reference
  - Updated telemetry-runbook.md with validation and benchmark sections
  - Updated 09-data-collection-reference.md with validation inventory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00

63 KiB
Raw Blame History

Observability Data Collection Reference

Audience: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack.

Related docs: docs/telemetry-runbook.md (operator runbook with alerting and troubleshooting) | 03-implementation-strategy.md (code structure and performance optimization) | 04-code-samples.md (C++ instrumentation examples)

Data Flow Overview

graph LR
    subgraph rippledNode["rippled Node"]
        A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
        B["beast::insight<br/>OTel native metrics<br/>(OTLP/HTTP exporter)"]
        C["MetricsRegistry<br/>OTel SDK metrics<br/>(OTLP/HTTP exporter)"]
    end

    subgraph collector["OTel Collector  :4317 / :4318"]
        direction TB
        R1["OTLP Receiver<br/>:4317 gRPC  |  :4318 HTTP<br/>(traces + metrics)"]
        BP["Batch Processor<br/>timeout 1s, batch 100"]
        SM["SpanMetrics Connector<br/>derives RED metrics<br/>from trace spans"]

        R1 --> BP
        BP --> SM
    end

    subgraph backends["Trace Backends  (choose one or both)"]
        D["Jaeger  :16686<br/>Trace search &<br/>visualization"]
        T["Grafana Tempo<br/>(preferred for production)<br/>S3/GCS long-term storage"]
    end

    subgraph metrics["Metrics Stack"]
        E["Prometheus  :9090<br/>scrapes :8889<br/>span-derived + system metrics"]
    end

    subgraph viz["Visualization"]
        F["Grafana  :3000<br/>13 dashboards"]
    end

    A -->|"OTLP/HTTP :4318<br/>(traces + attributes)"| R1
    B -->|"OTLP/HTTP :4318<br/>(gauges, counters, histograms)"| R1
    C -->|"OTLP/HTTP :4318<br/>(counters, histograms,<br/>observable gauges)"| R1

    BP -->|"OTLP/gRPC :4317"| D
    BP -->|"OTLP/gRPC"| T

    SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
    R1 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* histograms"| E

    E -->|"Prometheus<br/>data source"| F
    D -->|"Jaeger<br/>data source"| F
    T -->|"Tempo<br/>data source"| F

    style A fill:#4a90d9,color:#fff,stroke:#2a6db5
    style B fill:#4a90d9,color:#fff,stroke:#2a6db5
    style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
    style BP fill:#449d44,color:#fff,stroke:#2d6e2d
    style SM fill:#449d44,color:#fff,stroke:#2d6e2d
    style D fill:#f0ad4e,color:#000,stroke:#c78c2e
    style T fill:#e8953a,color:#000,stroke:#b5732a
    style E fill:#f0ad4e,color:#000,stroke:#c78c2e
    style F fill:#5bc0de,color:#000,stroke:#3aa8c1
    style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
    style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
    style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de

There are two independent telemetry pipelines entering a single OTel Collector via the same OTLP receiver:

  1. OpenTelemetry Traces — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's OTLP Receiver. The Batch Processor groups spans (1s timeout, batch size 100) before forwarding to trace backends. The SpanMetrics Connector derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
  2. beast::insight OTel Metrics — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same OTLP Receiver. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP; server=statsd remains available as a fallback.

Trace backends — The collector exports traces via OTLP/gRPC to one or both:

  • Jaeger (development) — Provides trace search UI at :16686. Easy single-binary setup.
  • Grafana Tempo (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana.

Further reading: 00-tracing-fundamentals.md for core OpenTelemetry concepts (traces, spans, context propagation, sampling). 07-observability-backends.md for production backend selection, collector placement, and sampling strategies.


1. OpenTelemetry Spans

1.1 Complete Span Inventory (16 spans)

See also: 02-design-decisions.md §2.3 for naming conventions and the full span catalog with rationale. 04-code-samples.md §4.6 for span flow diagrams.

RPC Spans

Controlled by trace_rpc=1 in [telemetry] config.

Span Name Parent Source File Description
rpc.request ServerHandler.cpp Top-level HTTP RPC request entry point
rpc.process rpc.request ServerHandler.cpp RPC processing pipeline
rpc.ws_message ServerHandler.cpp WebSocket message handling
rpc.command.<name> rpc.process RPCHandler.cpp Per-command span (e.g., rpc.command.server_info, rpc.command.ledger)

Where to find: Jaeger → Service: rippled → Operation: rpc.request or rpc.command.*

Grafana dashboard: RPC Performance (rippled-rpc-perf)

Transaction Spans

Controlled by trace_transactions=1 in [telemetry] config.

Span Name Parent Source File Description
tx.process NetworkOPs.cpp Transaction submission entry point (local or peer-relayed)
tx.receive PeerImp.cpp Raw transaction received from peer overlay (before deduplication)
tx.apply ledger.build BuildLedger.cpp Transaction set applied to new ledger during consensus

Where to find: Jaeger → Operation: tx.process or tx.receive

Grafana dashboard: Transaction Overview (rippled-transactions)

Consensus Spans

Controlled by trace_consensus=1 in [telemetry] config.

Span Name Parent Source File Description
consensus.proposal.send RCLConsensus.cpp Node broadcasts its transaction set proposal
consensus.ledger_close RCLConsensus.cpp Ledger close event triggered by consensus
consensus.accept RCLConsensus.cpp Consensus accepts a ledger (round complete)
consensus.validation.send RCLConsensus.cpp Validation message sent after ledger accepted
consensus.accept.apply RCLConsensus.cpp Ledger application with close time details

Where to find: Jaeger → Operation: consensus.*

Grafana dashboard: Consensus Health (rippled-consensus)

Ledger Spans

Controlled by trace_ledger=1 in [telemetry] config.

Span Name Parent Source File Description
ledger.build BuildLedger.cpp Build new ledger from accepted transaction set
ledger.validate LedgerMaster.cpp Ledger promoted to validated status
ledger.store LedgerMaster.cpp Ledger stored to database/history

Where to find: Jaeger → Operation: ledger.*

Grafana dashboard: Ledger Operations (rippled-ledger-ops)

Peer Spans

Controlled by trace_peer=1 in [telemetry] config. Disabled by default (high volume).

Span Name Parent Source File Description
peer.proposal.receive PeerImp.cpp Consensus proposal received from peer
peer.validation.receive PeerImp.cpp Validation message received from peer

Where to find: Jaeger → Operation: peer.*

Grafana dashboard: Peer Network (rippled-peer-net)


1.2 Complete Attribute Inventory (22 attributes)

See also: 02-design-decisions.md §2.4.2 for attribute design rationale and privacy considerations.

Every span can carry key-value attributes that provide context for filtering and aggregation.

RPC Attributes

Attribute Type Set On Description
xrpl.rpc.command string rpc.command.* RPC command name (e.g., server_info, ledger)
xrpl.rpc.version int64 rpc.command.* API version number
xrpl.rpc.role string rpc.command.* Caller role: "admin" or "user"
xrpl.rpc.status string rpc.command.* Result: "success" or "error"
xrpl.rpc.duration_ms int64 rpc.command.* Command execution time in milliseconds
xrpl.rpc.error_message string rpc.command.* Error details (only set on failure)

Jaeger query: Tag xrpl.rpc.command=server_info to find all server_info calls.

Prometheus label: xrpl_rpc_command (dots converted to underscores by SpanMetrics).

Transaction Attributes

Attribute Type Set On Description
xrpl.tx.hash string tx.process, tx.receive Transaction hash (hex-encoded)
xrpl.tx.local boolean tx.process true if locally submitted, false if peer-relayed
xrpl.tx.path string tx.process Submission path: "sync" or "async"
xrpl.tx.suppressed boolean tx.receive true if transaction was suppressed (duplicate)
xrpl.tx.status string tx.receive Transaction status (e.g., "known_bad")

Jaeger query: Tag xrpl.tx.hash=<hash> to trace a specific transaction across nodes.

Prometheus label: xrpl_tx_local (used as SpanMetrics dimension).

Consensus Attributes

Attribute Type Set On Description
xrpl.consensus.round int64 consensus.proposal.send Consensus round number
xrpl.consensus.mode string consensus.proposal.send, consensus.ledger_close Node mode: "syncing", "tracking", "full", "proposing"
xrpl.consensus.proposers int64 consensus.proposal.send, consensus.accept Number of proposers in the round
xrpl.consensus.proposing boolean consensus.validation.send Whether this node was a proposer
xrpl.consensus.ledger.seq int64 consensus.ledger_close, consensus.accept, consensus.validation.send, consensus.accept.apply Ledger sequence number
xrpl.consensus.close_time int64 consensus.accept.apply Agreed-upon ledger close time (epoch seconds)
xrpl.consensus.close_time_correct boolean consensus.accept.apply Whether validators reached agreement on close time
xrpl.consensus.close_resolution_ms int64 consensus.accept.apply Close time rounding granularity in milliseconds
xrpl.consensus.state string consensus.accept.apply Consensus outcome: "finished" or "moved_on"
xrpl.consensus.round_time_ms int64 consensus.accept.apply Total consensus round duration in milliseconds

Jaeger query: Tag xrpl.consensus.mode=proposing to find rounds where node was proposing.

Prometheus label: xrpl_consensus_mode (used as SpanMetrics dimension).

Ledger Attributes

Attribute Type Set On Description
xrpl.ledger.seq int64 ledger.build, ledger.validate, ledger.store, tx.apply Ledger sequence number
xrpl.ledger.validations int64 ledger.validate Number of validations received for this ledger
xrpl.ledger.tx_count int64 ledger.build, tx.apply Transactions in the ledger
xrpl.ledger.tx_failed int64 ledger.build, tx.apply Failed transactions in the ledger

Jaeger query: Tag xrpl.ledger.seq=12345 to find all spans for a specific ledger.

Peer Attributes

Attribute Type Set On Description
xrpl.peer.id int64 tx.receive, peer.proposal.receive, peer.validation.receive Peer identifier
xrpl.peer.proposal.trusted boolean peer.proposal.receive Whether the proposal came from a trusted validator
xrpl.peer.validation.trusted boolean peer.validation.receive Whether the validation came from a trusted validator

Prometheus labels: xrpl_peer_proposal_trusted, xrpl_peer_validation_trusted (SpanMetrics dimensions).


1.3 SpanMetrics — Derived Prometheus Metrics

See also: 01-architecture-analysis.md §1.8.2 for how span-derived metrics map to operational insights.

The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed.

Prometheus Metric Type Description
traces_span_metrics_calls_total Counter Total span invocations
traces_span_metrics_duration_milliseconds_bucket Histogram Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms)
traces_span_metrics_duration_milliseconds_count Histogram Observation count
traces_span_metrics_duration_milliseconds_sum Histogram Cumulative latency

Standard labels on every metric: span_name, status_code, service_name, span_kind

Additional dimension labels (configured in otel-collector-config.yaml):

Span Attribute Prometheus Label Applies To
xrpl.rpc.command xrpl_rpc_command rpc.command.*
xrpl.rpc.status xrpl_rpc_status rpc.command.*
xrpl.consensus.mode xrpl_consensus_mode consensus.ledger_close
xrpl.tx.local xrpl_tx_local tx.process
xrpl.peer.proposal.trusted xrpl_peer_proposal_trusted peer.proposal.receive
xrpl.peer.validation.trusted xrpl_peer_validation_trusted peer.validation.receive

Where to query: Prometheus → traces_span_metrics_calls_total{span_name="rpc.command.server_info"}


2. System Metrics (beast::insight — OTel native)

See also: 02-design-decisions.md for the beast::insight coexistence design. 06-implementation-phases.md for the Phase 6/7 metric inventory.

Migration complete: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The beast::insight::Collector interface and all metric names are preserved — only the wire protocol changed. [insight] server=statsd remains as a fallback.

These are system-level metrics emitted by rippled's beast::insight framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.

Configuration

# Recommended: native OTel metrics via OTLP/HTTP
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled

Fallback (StatsD):

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled

2.1 Gauges

Prometheus Metric Source File Description Typical Range
rippled_LedgerMaster_Validated_Ledger_Age LedgerMaster.h Seconds since last validated ledger 010 (healthy), >30 (stale)
rippled_LedgerMaster_Published_Ledger_Age LedgerMaster.h Seconds since last published ledger 010 (healthy)
rippled_State_Accounting_Disconnected_duration NetworkOPs.cpp Cumulative seconds in Disconnected state Monotonic
rippled_State_Accounting_Connected_duration NetworkOPs.cpp Cumulative seconds in Connected state Monotonic
rippled_State_Accounting_Syncing_duration NetworkOPs.cpp Cumulative seconds in Syncing state Monotonic
rippled_State_Accounting_Tracking_duration NetworkOPs.cpp Cumulative seconds in Tracking state Monotonic
rippled_State_Accounting_Full_duration NetworkOPs.cpp Cumulative seconds in Full state Monotonic (should dominate)
rippled_State_Accounting_Disconnected_transitions NetworkOPs.cpp Count of transitions to Disconnected Low
rippled_State_Accounting_Connected_transitions NetworkOPs.cpp Count of transitions to Connected Low
rippled_State_Accounting_Syncing_transitions NetworkOPs.cpp Count of transitions to Syncing Low
rippled_State_Accounting_Tracking_transitions NetworkOPs.cpp Count of transitions to Tracking Low
rippled_State_Accounting_Full_transitions NetworkOPs.cpp Count of transitions to Full Low (should be 1 after startup)
rippled_Peer_Finder_Active_Inbound_Peers PeerfinderManager.cpp Active inbound peer connections 085
rippled_Peer_Finder_Active_Outbound_Peers PeerfinderManager.cpp Active outbound peer connections 1021
rippled_Overlay_Peer_Disconnects OverlayImpl.cpp Cumulative peer disconnection count Low growth
rippled_job_count JobQueue.cpp Current job queue depth 0100 (healthy)

Grafana dashboard: Node Health (System Metrics) (rippled-system-node-health)

2.2 Counters

Prometheus Metric Source File Description
rippled_rpc_requests ServerHandler.cpp Total RPC requests received
rippled_ledger_fetches InboundLedgers.cpp Inbound ledger fetch attempts
rippled_ledger_history_mismatch LedgerHistory.cpp Ledger hash mismatches detected
rippled_warn Logic.h Resource manager warnings issued
rippled_drop Logic.h Resource manager drops (connections rejected)

Note: With server=otel, rippled_warn and rippled_drop are properly exported as OTel Counter instruments. The previous StatsD |m type limitation no longer applies.

Grafana dashboard: RPC & Pathfinding (System Metrics) (rippled-system-rpc)

2.3 Histograms (Event timers)

Prometheus Metric Source File Unit Description
rippled_rpc_time ServerHandler.cpp ms RPC response time distribution
rippled_rpc_size ServerHandler.cpp bytes RPC response size distribution
rippled_ios_latency Application.cpp ms I/O service loop latency
rippled_pathfind_fast PathRequests.h ms Fast pathfinding duration
rippled_pathfind_full PathRequests.h ms Full pathfinding duration

Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.

Grafana dashboards: Node Health (ios_latency), RPC & Pathfinding (rpc_time, rpc_size, pathfind_*)

2.4 Overlay Traffic Metrics

For each of the 45+ overlay traffic categories (defined in TrafficCount.h), four gauges are emitted:

  • rippled_{category}_Bytes_In
  • rippled_{category}_Bytes_Out
  • rippled_{category}_Messages_In
  • rippled_{category}_Messages_Out

Key categories:

Category Description
total All traffic aggregated
overhead / overhead_overlay Protocol overhead
transactions / transactions_duplicate Transaction relay
proposals / proposals_untrusted / proposals_duplicate Consensus proposals
validations / validations_untrusted / validations_duplicate Consensus validations
ledger_data_get / ledger_data_share Ledger data exchange
ledger_data_Transaction_Node_get/share Transaction node data
ledger_data_Account_State_Node_get/share Account state node data
ledger_data_Transaction_Set_candidate_get/share Transaction set candidates
getObject / haveTxSet / ledgerData Object requests
ping / status Keepalive and status
set_get Set requests

Grafana dashboards: Network Traffic (rippled-system-network), Overlay Traffic Detail (rippled-system-overlay-detail), Ledger Data & Sync (rippled-system-ledger-sync)


3. Grafana Dashboard Reference

See also: 05-configuration-reference.md §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples.

3.1 Span-Derived Dashboards (5)

Dashboard UID Data Source Key Panels
RPC Performance rippled-rpc-perf Prometheus (SpanMetrics) Request rate by command, p95 latency by command, error rate, heatmap, top commands
Transaction Overview rippled-transactions Prometheus (SpanMetrics) Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap
Consensus Health rippled-consensus Prometheus (SpanMetrics) Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap
Ledger Operations rippled-ledger-ops Prometheus (SpanMetrics) Build rate, build duration, validation rate, store rate, build vs close comparison
Peer Network rippled-peer-net Prometheus (SpanMetrics) Proposal receive rate, validation receive rate, trusted vs untrusted breakdown

3.2 System Metrics Dashboards (5)

Dashboard UID Data Source Key Panels
Node Health rippled-system-node-health Prometheus (OTLP) Ledger age, operating mode, I/O latency, job queue, fetch rate
Network Traffic rippled-system-network Prometheus (OTLP) Active peers, disconnects, bytes in/out, messages in/out, traffic by category
RPC & Pathfinding rippled-system-rpc Prometheus (OTLP) RPC rate, response time/size, pathfinding duration, resource warnings/drops
Overlay Traffic Detail rippled-system-overlay-detail Prometheus (OTLP) Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths
Ledger Data & Sync rippled-system-ledger-sync Prometheus (OTLP) Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap

3.3 Accessing the Dashboards

  1. Open Grafana at http://localhost:3000
  2. Navigate to Dashboards → rippled folder
  3. All 10 dashboards are auto-provisioned from docker/telemetry/grafana/dashboards/

4. Jaeger Trace Search Guide

See also: 08-appendix.md §8.2 for span hierarchy visualizations. 05-configuration-reference.md §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger.

Finding Traces by Type

What to Find Jaeger Search Parameters
All RPC calls Service: rippled, Operation: rpc.request
Specific RPC command Operation: rpc.command.server_info (or any command name)
Slow RPC calls Operation: rpc.command.*, Min Duration: 100ms
Failed RPC calls Tag: xrpl.rpc.status=error
Specific transaction Tag: xrpl.tx.hash=<hex_hash>
Local transactions only Tag: xrpl.tx.local=true
Consensus rounds Operation: consensus.accept
Rounds by mode Tag: xrpl.consensus.mode=proposing
Specific ledger Tag: xrpl.ledger.seq=12345
Peer proposals (trusted) Tag: xrpl.peer.proposal.trusted=true

Trace Structure

A typical RPC trace shows the span hierarchy:

rpc.request (ServerHandler)
  └── rpc.process (ServerHandler)
       └── rpc.command.server_info (RPCHandler)

A consensus round produces independent spans (not parent-child):

consensus.ledger_close        (close event)
consensus.proposal.send       (broadcast proposal)
ledger.build                  (build new ledger)
  └── tx.apply                (apply transaction set)
consensus.accept              (accept result)
consensus.validation.send     (send validation)
ledger.validate               (promote to validated)
ledger.store                  (persist to DB)

5. Prometheus Query Examples

See also: 05-configuration-reference.md §5.8.7 for correlating Prometheus system metrics with trace-derived metrics.

Span-Derived Metrics

# RPC request rate by command (last 5 minutes)
sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))

# RPC p95 latency by command
histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))

# Consensus round duration p95
histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m])))

# Transaction processing rate (local vs relay)
sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))

# Trusted vs untrusted proposal rate
sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))

StatsD Metrics

# Validated ledger age (should be < 10s)
rippled_LedgerMaster_Validated_Ledger_Age

# Active peer count
rippled_Peer_Finder_Active_Inbound_Peers + rippled_Peer_Finder_Active_Outbound_Peers

# RPC response time p95
histogram_quantile(0.95, rippled_rpc_time_bucket)

# Total network bytes in (rate)
rate(rippled_total_Bytes_In[5m])

# Operating mode (should be "Full" after startup)
rippled_State_Accounting_Full_duration

5a. Log-Trace Correlation (Phase 8)

Plan details: 06-implementation-phases.md §6.8.1 — motivation, architecture, Mermaid diagrams Task breakdown: Phase8_taskList.md — per-task implementation details

Phase 8 injects OTel trace context into rippled's Logs::format() output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:

Log Format

<timestamp> <partition>:<severity> trace_id=<32hex> span_id=<16hex> <message>

Example:

2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42
  • trace_id=<hex32> — 32-character lowercase hex trace identifier. Links to the distributed trace in Tempo/Jaeger.
  • span_id=<hex16> — 16-character lowercase hex span identifier. Identifies the specific span within the trace.
  • Only present when the log is emitted within an active OTel span. Log lines outside of traced code paths have no trace context fields.

Implementation

The trace context injection is implemented in Logs::format() (src/libxrpl/basics/Log.cpp), guarded by #ifdef XRPL_ENABLE_TELEMETRY. It reads the current span from OTel's thread-local runtime context via opentelemetry::trace::GetSpan() and opentelemetry::context::RuntimeContext::GetCurrent(). Both calls are lock-free thread-local reads measured at <10ns per call.

Log Ingestion Pipeline

rippled debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki

The OTel Collector's filelog receiver tails debug.log files and uses a regex_parser operator to extract structured fields:

Field Type Description
timestamp datetime Log timestamp
partition string Log partition (e.g., LedgerMaster, PeerImp)
severity string Severity code (TRC, DBG, NFO, WRN, ERR, FTL)
trace_id string 32-hex trace identifier (optional)
span_id string 16-hex span identifier (optional)
message string Log message body

Grafana Correlation

Bidirectional linking between logs and traces is configured via Grafana datasource provisioning:

  • Tempo -> Loki (tracesToLogs): Clicking "Logs for this trace" on a Tempo trace view filters Loki logs by trace_id, showing all log lines from that trace.
  • Loki -> Tempo (derivedFields): A regex-based derived field on the Loki datasource extracts trace_id from log lines and renders it as a clickable link to the corresponding trace in Tempo.

Loki Backend

Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries from the OTel Collector's loki exporter via the push API at http://loki:3100/loki/api/v1/push.

LogQL Query Examples

# Find all logs for a specific trace
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"

# Error logs with trace context
{job="rippled"} |= "ERR" |= "trace_id="

# Logs from a specific partition with trace context
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""

# Count traced log lines over time
count_over_time({job="rippled"} |= "trace_id=" [5m])

5b. Future: Internal Metric Gap Fill (Phase 9)

Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.2 — motivation, architecture, third-party context Task breakdown: Phase9_taskList.md — per-task implementation details

Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: beast::insight extensions for NodeStore I/O, OTel ObservableGauge async callbacks for new categories.

New Metric Categories

NodeStore I/O (via beast::insight)

Prometheus Metric Type Description
rippled_nodestore_reads_total Gauge Cumulative read operations
rippled_nodestore_reads_hit Gauge Cache-served reads
rippled_nodestore_writes Gauge Cumulative write operations
rippled_nodestore_written_bytes Gauge Cumulative bytes written
rippled_nodestore_read_bytes Gauge Cumulative bytes read
rippled_nodestore_read_duration_us Gauge Cumulative read time (microseconds)
rippled_nodestore_write_load Gauge Current write load score
rippled_nodestore_read_queue Gauge Items in read queue

Cache Hit Rates (via OTel MetricsRegistry)

Prometheus Metric Type Description
rippled_cache_SLE_hit_rate Gauge SLE cache hit rate (0.0-1.0)
rippled_cache_ledger_hit_rate Gauge Ledger object cache hit rate
rippled_cache_AL_hit_rate Gauge AcceptedLedger cache hit rate
rippled_cache_treenode_size Gauge SHAMap TreeNode cache size (entries)
rippled_cache_fullbelow_size Gauge FullBelow cache size

Transaction Queue (via OTel MetricsRegistry)

Prometheus Metric Type Description
rippled_txq_count Gauge Current transactions in queue
rippled_txq_max_size Gauge Maximum queue capacity
rippled_txq_in_ledger Gauge Transactions in open ledger
rippled_txq_per_ledger Gauge Expected transactions per ledger
rippled_txq_open_ledger_fee_level Gauge Open ledger fee escalation level
rippled_txq_med_fee_level Gauge Median fee level in queue
rippled_txq_reference_fee_level Gauge Reference fee level
rippled_txq_min_processing_fee_level Gauge Minimum fee to get processed

PerfLog Per-RPC Method (via OTel Metrics SDK)

Prometheus Metric Type Labels Description
rippled_rpc_method_started_total Counter method="<name>" RPC calls started
rippled_rpc_method_finished_total Counter method="<name>" RPC calls completed
rippled_rpc_method_errored_total Counter method="<name>" RPC calls errored
rippled_rpc_method_duration_us_bucket Histogram method="<name>" Execution time distribution

PerfLog Per-Job Type (via OTel Metrics SDK)

Prometheus Metric Type Labels Description
rippled_job_queued_total Counter job_type="<name>" Jobs queued
rippled_job_started_total Counter job_type="<name>" Jobs started
rippled_job_finished_total Counter job_type="<name>" Jobs completed
rippled_job_queued_duration_us_bucket Histogram job_type="<name>" Queue wait time
rippled_job_running_duration_us_bucket Histogram job_type="<name>" Execution time

Counted Object Instances (via OTel MetricsRegistry)

Prometheus Metric Type Labels Description
rippled_object_count Gauge type="<name>" Live instances of internal type

Tracked types: Transaction, Ledger, NodeObject, STTx, STLedgerEntry, InboundLedger, Pathfinder, PathRequest, HashRouterEntry

Fee Escalation & Load Factors (via OTel MetricsRegistry)

Prometheus Metric Type Description
rippled_load_factor Gauge Combined transaction cost multiplier
rippled_load_factor_server Gauge Server + cluster + network load
rippled_load_factor_local Gauge Local server load only
rippled_load_factor_net Gauge Network-wide load estimate
rippled_load_factor_cluster Gauge Cluster peer load
rippled_load_factor_fee_escalation Gauge Open ledger fee escalation
rippled_load_factor_fee_queue Gauge Queue entry fee level

New Grafana Dashboards (Phase 9)

Dashboard UID Data Source Key Panels
Fee Market & TxQ rippled-fee-market Prometheus TxQ depth/capacity, fee levels, load factor breakdown, escalation
Job Queue Analysis rippled-job-queue Prometheus Per-job rates, queue wait times, execution times, queue depth

5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)

Plan details: 06-implementation-phases.md §6.8.3 — motivation, architecture Task breakdown: Phase10_taskList.md — per-task implementation details Tools: docker/telemetry/workload/ — RPC load generator, transaction submitter, validation suite, benchmarks

Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead.

Running the Validation Suite

# Full end-to-end validation (start cluster, generate load, validate):
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld

# Validation only (assumes stack and cluster are already running):
python3 docker/telemetry/workload/validate_telemetry.py --report /tmp/report.json

# Performance benchmark (baseline vs telemetry):
docker/telemetry/workload/benchmark.sh --xrpld .build/xrpld --duration 300

Validated Telemetry Inventory

Category Expected Count Validation Method Config File
Trace spans 17 Jaeger/Tempo API query expected_spans.json
Span attributes 22 Per-span attribute assertion expected_spans.json
StatsD metrics 255+ Prometheus query expected_metrics.json
Phase 9 metrics 50+ Prometheus query expected_metrics.json
SpanMetrics RED 4 per span Prometheus query expected_metrics.json
Grafana dashboards 10 Dashboard API "no data" check expected_metrics.json
Log-trace links Present Loki query + Tempo reverse check

Performance Overhead Targets

Metric Target Measurement Method
CPU overhead < 3% ps avg CPU% baseline vs telemetry
Memory overhead < 5MB ps peak RSS baseline vs telemetry
RPC p99 latency < 2ms impact server_info round-trip timing
Throughput impact < 5% Ledger close rate comparison
Consensus impact < 1% Consensus round time p95 comparison

5d. Future: Third-Party Data Collection Pipelines (Phase 11)

Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.4 — motivation, architecture, consumer gap analysis Task breakdown: Phase11_taskList.md — per-task implementation details

Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports xrpl_* metrics for external consumers. No rippled code changes.

Exported Metrics (via Custom OTel Collector Receiver)

Node Health (from server_info)

Prometheus Metric Type Description
xrpl_server_state Gauge Operating mode (0=disconnected ... 5=proposing)
xrpl_server_state_duration_seconds Gauge Seconds in current state
xrpl_uptime_seconds Gauge Consecutive seconds running
xrpl_io_latency_ms Gauge I/O subsystem latency
xrpl_amendment_blocked Gauge 1 if amendment-blocked, 0 otherwise
xrpl_peers_count Gauge Connected peers
xrpl_validated_ledger_seq Gauge Latest validated ledger sequence
xrpl_validated_ledger_age_seconds Gauge Seconds since last validated close
xrpl_last_close_proposers Gauge Proposers in last consensus round
xrpl_last_close_converge_time_seconds Gauge Last consensus round duration
xrpl_load_factor Gauge Transaction cost multiplier
xrpl_state_duration_seconds Gauge Per-state duration (state label)
xrpl_state_transitions_total Gauge Per-state transition count (state label)

Peer Topology (from peers)

Prometheus Metric Type Description
xrpl_peers_inbound_count Gauge Inbound peer connections
xrpl_peers_outbound_count Gauge Outbound peer connections
xrpl_peer_latency_p50_ms Gauge Median peer latency
xrpl_peer_latency_p95_ms Gauge p95 peer latency
xrpl_peer_version_count Gauge Peers per version (version label)
xrpl_peer_diverged_count Gauge Peers with diverged tracking status

Validator & Amendment (from validators, feature)

Prometheus Metric Type Description
xrpl_trusted_validators_count Gauge UNL validator count
xrpl_amendment_enabled_count Gauge Enabled amendments
xrpl_amendment_majority_count Gauge Amendments with majority
xrpl_amendment_unsupported_majority Gauge 1 if unsupported amendment has majority
xrpl_validator_list_active Gauge 1 if validator list is active

Fee Market (from fee)

Prometheus Metric Type Description
xrpl_fee_open_ledger_fee_drops Gauge Minimum fee for open ledger inclusion
xrpl_fee_median_fee_drops Gauge Median fee level
xrpl_fee_queue_size Gauge Current transaction queue depth
xrpl_fee_current_ledger_size Gauge Transactions in current open ledger

DEX & AMM (optional, from book_offers, amm_info)

Prometheus Metric Type Labels Description
xrpl_amm_tvl_drops Gauge pool="<id>" Total value locked
xrpl_amm_trading_fee Gauge pool="<id>" Pool trading fee (bps)
xrpl_orderbook_bid_depth Gauge pair="<base/quote>" Total bid volume
xrpl_orderbook_ask_depth Gauge pair="<base/quote>" Total ask volume
xrpl_orderbook_spread Gauge pair="<base/quote>" Best bid-ask spread

Phase 9: OTel SDK-Exported Metrics (MetricsRegistry)

Phase 9 introduces the MetricsRegistry class (src/xrpld/telemetry/MetricsRegistry.h/.cpp) which registers metrics directly with the OpenTelemetry Metrics SDK. These are exported via OTLP/HTTP to the OTel Collector and scraped by Prometheus.

NodeStore I/O (Observable Gauge — nodestore_state)

Prometheus Metric Type Labels Description
rippled_nodestore_state{metric="node_reads_total"} Gauge metric Cumulative NodeStore read operations
rippled_nodestore_state{metric="node_reads_hit"} Gauge metric Reads served from cache
rippled_nodestore_state{metric="node_writes"} Gauge metric Cumulative write operations
rippled_nodestore_state{metric="node_written_bytes"} Gauge metric Cumulative bytes written
rippled_nodestore_state{metric="node_read_bytes"} Gauge metric Cumulative bytes read
rippled_nodestore_state{metric="write_load"} Gauge metric Current write load score
rippled_nodestore_state{metric="read_queue"} Gauge metric Items in read prefetch queue

Cache Hit Rates & Sizes (Observable Gauge — cache_metrics)

Prometheus Metric Type Labels Description
rippled_cache_metrics{metric="SLE_hit_rate"} Gauge metric SLE cache hit rate (0.0-1.0)
rippled_cache_metrics{metric="ledger_hit_rate"} Gauge metric Ledger cache hit rate
rippled_cache_metrics{metric="AL_hit_rate"} Gauge metric AcceptedLedger cache hit rate
rippled_cache_metrics{metric="treenode_cache_size"} Gauge metric SHAMap TreeNode cache entries
rippled_cache_metrics{metric="treenode_track_size"} Gauge metric Tracked tree nodes
rippled_cache_metrics{metric="fullbelow_size"} Gauge metric FullBelow cache entries

Transaction Queue (Observable Gauge — txq_metrics)

Prometheus Metric Type Labels Description
rippled_txq_metrics{metric="txq_count"} Gauge metric Transactions currently in queue
rippled_txq_metrics{metric="txq_max_size"} Gauge metric Maximum queue capacity
rippled_txq_metrics{metric="txq_in_ledger"} Gauge metric Transactions in open ledger
rippled_txq_metrics{metric="txq_per_ledger"} Gauge metric Expected transactions per ledger
rippled_txq_metrics{metric="txq_reference_fee_level"} Gauge metric Reference fee level
rippled_txq_metrics{metric="txq_min_processing_fee_level"} Gauge metric Minimum fee to get processed
rippled_txq_metrics{metric="txq_med_fee_level"} Gauge metric Median fee level in queue
rippled_txq_metrics{metric="txq_open_ledger_fee_level"} Gauge metric Open ledger fee escalation level

Per-RPC Method Metrics (Synchronous Counters/Histogram)

Prometheus Metric Type Labels Description
rippled_rpc_method_started_total Counter method="<name>" RPC calls started
rippled_rpc_method_finished_total Counter method="<name>" RPC calls completed successfully
rippled_rpc_method_errored_total Counter method="<name>" RPC calls that errored
rippled_rpc_method_duration_us Histogram method="<name>" Execution time distribution (us)

Per-Job-Type Metrics (Synchronous Counters/Histogram)

Prometheus Metric Type Labels Description
rippled_job_queued_total Counter job_type="<name>" Jobs enqueued
rippled_job_started_total Counter job_type="<name>" Jobs started
rippled_job_finished_total Counter job_type="<name>" Jobs completed
rippled_job_queued_duration_us Histogram job_type="<name>" Queue wait time distribution (us)
rippled_job_running_duration_us Histogram job_type="<name>" Execution time distribution (us)

Counted Object Instances (Observable Gauge — object_count)

Prometheus Metric Type Labels Description
rippled_object_count{type="Transaction"} Gauge type="<name>" Live Transaction objects
rippled_object_count{type="Ledger"} Gauge type="<name>" Live Ledger objects
rippled_object_count{type="NodeObject"} Gauge type="<name>" Live NodeObject instances
rippled_object_count{type="STTx"} Gauge type="<name>" Serialized transaction objects
rippled_object_count{type="STLedgerEntry"} Gauge type="<name>" Serialized ledger entries
rippled_object_count{type="InboundLedger"} Gauge type="<name>" Ledgers being fetched
rippled_object_count{type="Pathfinder"} Gauge type="<name>" Active pathfinding operations
rippled_object_count{type="PathRequest"} Gauge type="<name>" Active path requests
rippled_object_count{type="HashRouterEntry"} Gauge type="<name>" Hash router entries

Load Factor Breakdown (Observable Gauge — load_factor_metrics)

Prometheus Metric Type Labels Description
rippled_load_factor_metrics{metric="load_factor"} Gauge metric Combined transaction cost multiplier
rippled_load_factor_metrics{metric="load_factor_server"} Gauge metric Server + cluster + network contribution
rippled_load_factor_metrics{metric="load_factor_local"} Gauge metric Local server load only
rippled_load_factor_metrics{metric="load_factor_net"} Gauge metric Network-wide load estimate
rippled_load_factor_metrics{metric="load_factor_cluster"} Gauge metric Cluster peer load
rippled_load_factor_metrics{metric="load_factor_fee_escalation"} Gauge metric Open ledger fee escalation
rippled_load_factor_metrics{metric="load_factor_fee_queue"} Gauge metric Queue entry fee level

Prometheus Query Examples (Phase 9)

# NodeStore cache hit ratio
rippled_nodestore_state{metric="node_reads_hit"} / rippled_nodestore_state{metric="node_reads_total"}

# RPC error rate for server_info
rate(rippled_rpc_method_errored_total{method="server_info"}[5m])

# Job queue wait time p95
histogram_quantile(0.95, sum by (le) (rate(rippled_job_queued_duration_us_bucket[5m])))

# TxQ utilization percentage
rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"}

# High load factor alert candidate
rippled_load_factor_metrics{metric="load_factor"} > 5

New Grafana Dashboards (Phase 9)

Dashboard UID Data Source Key Panels
Fee Market & TxQ rippled-fee-market Prometheus TxQ depth/capacity, fee levels, load factor breakdown
Job Queue Analysis rippled-job-queue Prometheus Per-job rates, queue wait times, execution times
RPC Performance (OTel) rippled-rpc-perf Prometheus Per-method call rates, error rates, latency distributions

Updated Grafana Dashboards (Phase 9)

Dashboard UID New Panels Added
Node Health (StatsD) rippled-statsd-node-health NodeStore I/O, cache hit rates, object instance counts

New Grafana Dashboards (Phase 11)

Dashboard UID Data Source Key Panels
Validator Health rippled-validator-health Prometheus Server state timeline, proposer count, converge time, amendment voting
Network Topology rippled-network-topology Prometheus Peer count, version distribution, latency distribution, diverged peers
Fee Market (Ext) rippled-fee-market-external Prometheus Fee levels, queue depth, load factor breakdown, escalation timeline
DEX & AMM Overview rippled-dex-amm Prometheus AMM TVL, order book depth, spread trends, trading fee revenue

Prometheus Alerting Rules (Phase 11)

Alert Name Severity Condition For
XRPLServerNotFull Critical xrpl_server_state < 4 for 15m 15m
XRPLAmendmentBlocked Critical xrpl_amendment_blocked == 1 1m
XRPLNoPeers Critical xrpl_peers_count == 0 5m
XRPLLedgerStale Critical xrpl_validated_ledger_age_seconds > 120 2m
XRPLHighIOLatency Critical xrpl_io_latency_ms > 100 5m
XRPLUnsupportedAmendmentMajority Critical xrpl_amendment_unsupported_majority == 1 1m
XRPLLowPeerCount Warning xrpl_peers_count < 10 15m
XRPLHighLoadFactor Warning xrpl_load_factor > 10 10m
XRPLSlowConsensus Warning xrpl_last_close_converge_time_seconds > 6 5m
XRPLValidatorListExpiring Warning (xrpl_validator_list_expiration_seconds - time()) < 86400 1h
XRPLStateFlapping Warning rate(xrpl_state_transitions_total{state="full"}[1h]) > 2 30m

6. Known Issues

Issue Impact Status
warn and drop metrics use non-standard StatsD |m meter type Metrics silently dropped by OTel StatsD receiver Phase 6 Task 6.1 — needs |m|c change in StatsDCollector.cpp
rippled_job_count may not emit in standalone mode Missing from Prometheus in some test configs Requires active job queue activity
rippled_rpc_requests depends on [insight] config Zero series if StatsD not configured Requires [insight] server=statsd in xrpld.cfg
Peer tracing disabled by default No peer.* spans unless trace_peer=1 Intentional — high volume on mainnet

7. Privacy and Data Collection

The telemetry system is designed with privacy in mind:

  • No private keys are ever included in spans or metrics
  • No account balances or financial data is traced
  • Transaction hashes are included (public on-ledger data) but not transaction contents
  • Peer IDs are internal identifiers, not IP addresses
  • All telemetry is opt-in — disabled by default at build time (-Dtelemetry=OFF)
  • Sampling reduces data volume — sampling_ratio=0.01 recommended for production
  • Data stays local — the default stack sends data to localhost only

8. Configuration Quick Reference

Full reference: 05-configuration-reference.md §5.1 for all [telemetry] options with defaults, the config parser implementation, and collector YAML configurations (dev and production).

Minimal Setup (development)

[telemetry]
enabled=1

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled

Production Setup

[telemetry]
enabled=1
endpoint=http://otel-collector:4318/v1/traces
sampling_ratio=0.01
trace_peer=0
batch_size=1024
max_queue_size=4096

[insight]
server=statsd
address=otel-collector:8125
prefix=rippled

Trace Category Toggle

Config Key Default Controls
trace_rpc 1 rpc.* spans
trace_transactions 1 tx.* spans
trace_consensus 1 consensus.* spans
trace_ledger 1 ledger.* spans
trace_peer 0 peer.* spans (high volume)