Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:
Task 10.1 — Multi-node test harness:
- docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
Tempo, Prometheus, Loki, Grafana)
- generate-validator-keys.sh for automated key generation
- xrpld-validator.cfg.template for node configuration
Task 10.2 — RPC load generator:
- rpc_load_generator.py with WebSocket client, configurable rates,
realistic command distribution (40% health, 30% wallet, 15% explorer,
10% tx lookups, 5% DEX), W3C traceparent injection
Task 10.3 — Transaction submitter:
- tx_submitter.py with 10 transaction types (Payment, OfferCreate,
OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts
Task 10.4 — Telemetry validation suite:
- validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
log-trace correlation (Loki), dashboards (Grafana)
- expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
- expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)
Task 10.5 — Performance benchmark suite:
- benchmark.sh for baseline vs telemetry comparison
- collect_system_metrics.sh for CPU/memory/latency sampling
- Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus
Task 10.6 — CI integration:
- telemetry-validation.yml GitHub Actions workflow
- run-full-validation.sh orchestrator script
- Manual trigger + telemetry branch auto-trigger
Task 10.7 — Documentation:
- workload/README.md with quick start and tool reference
- Updated telemetry-runbook.md with validation and benchmark sections
- Updated 09-data-collection-reference.md with validation inventory
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
63 KiB
Observability Data Collection Reference
Audience: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack.
Related docs: docs/telemetry-runbook.md (operator runbook with alerting and troubleshooting) | 03-implementation-strategy.md (code structure and performance optimization) | 04-code-samples.md (C++ instrumentation examples)
Data Flow Overview
graph LR
subgraph rippledNode["rippled Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
B["beast::insight<br/>OTel native metrics<br/>(OTLP/HTTP exporter)"]
C["MetricsRegistry<br/>OTel SDK metrics<br/>(OTLP/HTTP exporter)"]
end
subgraph collector["OTel Collector :4317 / :4318"]
direction TB
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP<br/>(traces + metrics)"]
BP["Batch Processor<br/>timeout 1s, batch 100"]
SM["SpanMetrics Connector<br/>derives RED metrics<br/>from trace spans"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backends (choose one or both)"]
D["Jaeger :16686<br/>Trace search &<br/>visualization"]
T["Grafana Tempo<br/>(preferred for production)<br/>S3/GCS long-term storage"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + system metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000<br/>13 dashboards"]
end
A -->|"OTLP/HTTP :4318<br/>(traces + attributes)"| R1
B -->|"OTLP/HTTP :4318<br/>(gauges, counters, histograms)"| R1
C -->|"OTLP/HTTP :4318<br/>(counters, histograms,<br/>observable gauges)"| R1
BP -->|"OTLP/gRPC :4317"| D
BP -->|"OTLP/gRPC"| T
SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
R1 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* histograms"| E
E -->|"Prometheus<br/>data source"| F
D -->|"Jaeger<br/>data source"| F
T -->|"Tempo<br/>data source"| F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#4a90d9,color:#fff,stroke:#2a6db5
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style T fill:#e8953a,color:#000,stroke:#b5732a
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
There are two independent telemetry pipelines entering a single OTel Collector via the same OTLP receiver:
- OpenTelemetry Traces — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's OTLP Receiver. The Batch Processor groups spans (1s timeout, batch size 100) before forwarding to trace backends. The SpanMetrics Connector derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
- beast::insight OTel Metrics — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same OTLP Receiver. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP;
server=statsdremains available as a fallback.
Trace backends — The collector exports traces via OTLP/gRPC to one or both:
- Jaeger (development) — Provides trace search UI at
:16686. Easy single-binary setup. - Grafana Tempo (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana.
Further reading: 00-tracing-fundamentals.md for core OpenTelemetry concepts (traces, spans, context propagation, sampling). 07-observability-backends.md for production backend selection, collector placement, and sampling strategies.
1. OpenTelemetry Spans
1.1 Complete Span Inventory (16 spans)
See also: 02-design-decisions.md §2.3 for naming conventions and the full span catalog with rationale. 04-code-samples.md §4.6 for span flow diagrams.
RPC Spans
Controlled by trace_rpc=1 in [telemetry] config.
| Span Name | Parent | Source File | Description |
|---|---|---|---|
rpc.request |
— | ServerHandler.cpp | Top-level HTTP RPC request entry point |
rpc.process |
rpc.request |
ServerHandler.cpp | RPC processing pipeline |
rpc.ws_message |
— | ServerHandler.cpp | WebSocket message handling |
rpc.command.<name> |
rpc.process |
RPCHandler.cpp | Per-command span (e.g., rpc.command.server_info, rpc.command.ledger) |
Where to find: Jaeger → Service: rippled → Operation: rpc.request or rpc.command.*
Grafana dashboard: RPC Performance (rippled-rpc-perf)
Transaction Spans
Controlled by trace_transactions=1 in [telemetry] config.
| Span Name | Parent | Source File | Description |
|---|---|---|---|
tx.process |
— | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) |
tx.receive |
— | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
tx.apply |
ledger.build |
BuildLedger.cpp | Transaction set applied to new ledger during consensus |
Where to find: Jaeger → Operation: tx.process or tx.receive
Grafana dashboard: Transaction Overview (rippled-transactions)
Consensus Spans
Controlled by trace_consensus=1 in [telemetry] config.
| Span Name | Parent | Source File | Description |
|---|---|---|---|
consensus.proposal.send |
— | RCLConsensus.cpp | Node broadcasts its transaction set proposal |
consensus.ledger_close |
— | RCLConsensus.cpp | Ledger close event triggered by consensus |
consensus.accept |
— | RCLConsensus.cpp | Consensus accepts a ledger (round complete) |
consensus.validation.send |
— | RCLConsensus.cpp | Validation message sent after ledger accepted |
consensus.accept.apply |
— | RCLConsensus.cpp | Ledger application with close time details |
Where to find: Jaeger → Operation: consensus.*
Grafana dashboard: Consensus Health (rippled-consensus)
Ledger Spans
Controlled by trace_ledger=1 in [telemetry] config.
| Span Name | Parent | Source File | Description |
|---|---|---|---|
ledger.build |
— | BuildLedger.cpp | Build new ledger from accepted transaction set |
ledger.validate |
— | LedgerMaster.cpp | Ledger promoted to validated status |
ledger.store |
— | LedgerMaster.cpp | Ledger stored to database/history |
Where to find: Jaeger → Operation: ledger.*
Grafana dashboard: Ledger Operations (rippled-ledger-ops)
Peer Spans
Controlled by trace_peer=1 in [telemetry] config. Disabled by default (high volume).
| Span Name | Parent | Source File | Description |
|---|---|---|---|
peer.proposal.receive |
— | PeerImp.cpp | Consensus proposal received from peer |
peer.validation.receive |
— | PeerImp.cpp | Validation message received from peer |
Where to find: Jaeger → Operation: peer.*
Grafana dashboard: Peer Network (rippled-peer-net)
1.2 Complete Attribute Inventory (22 attributes)
See also: 02-design-decisions.md §2.4.2 for attribute design rationale and privacy considerations.
Every span can carry key-value attributes that provide context for filtering and aggregation.
RPC Attributes
| Attribute | Type | Set On | Description |
|---|---|---|---|
xrpl.rpc.command |
string | rpc.command.* |
RPC command name (e.g., server_info, ledger) |
xrpl.rpc.version |
int64 | rpc.command.* |
API version number |
xrpl.rpc.role |
string | rpc.command.* |
Caller role: "admin" or "user" |
xrpl.rpc.status |
string | rpc.command.* |
Result: "success" or "error" |
xrpl.rpc.duration_ms |
int64 | rpc.command.* |
Command execution time in milliseconds |
xrpl.rpc.error_message |
string | rpc.command.* |
Error details (only set on failure) |
Jaeger query: Tag xrpl.rpc.command=server_info to find all server_info calls.
Prometheus label: xrpl_rpc_command (dots converted to underscores by SpanMetrics).
Transaction Attributes
| Attribute | Type | Set On | Description |
|---|---|---|---|
xrpl.tx.hash |
string | tx.process, tx.receive |
Transaction hash (hex-encoded) |
xrpl.tx.local |
boolean | tx.process |
true if locally submitted, false if peer-relayed |
xrpl.tx.path |
string | tx.process |
Submission path: "sync" or "async" |
xrpl.tx.suppressed |
boolean | tx.receive |
true if transaction was suppressed (duplicate) |
xrpl.tx.status |
string | tx.receive |
Transaction status (e.g., "known_bad") |
Jaeger query: Tag xrpl.tx.hash=<hash> to trace a specific transaction across nodes.
Prometheus label: xrpl_tx_local (used as SpanMetrics dimension).
Consensus Attributes
| Attribute | Type | Set On | Description |
|---|---|---|---|
xrpl.consensus.round |
int64 | consensus.proposal.send |
Consensus round number |
xrpl.consensus.mode |
string | consensus.proposal.send, consensus.ledger_close |
Node mode: "syncing", "tracking", "full", "proposing" |
xrpl.consensus.proposers |
int64 | consensus.proposal.send, consensus.accept |
Number of proposers in the round |
xrpl.consensus.proposing |
boolean | consensus.validation.send |
Whether this node was a proposer |
xrpl.consensus.ledger.seq |
int64 | consensus.ledger_close, consensus.accept, consensus.validation.send, consensus.accept.apply |
Ledger sequence number |
xrpl.consensus.close_time |
int64 | consensus.accept.apply |
Agreed-upon ledger close time (epoch seconds) |
xrpl.consensus.close_time_correct |
boolean | consensus.accept.apply |
Whether validators reached agreement on close time |
xrpl.consensus.close_resolution_ms |
int64 | consensus.accept.apply |
Close time rounding granularity in milliseconds |
xrpl.consensus.state |
string | consensus.accept.apply |
Consensus outcome: "finished" or "moved_on" |
xrpl.consensus.round_time_ms |
int64 | consensus.accept.apply |
Total consensus round duration in milliseconds |
Jaeger query: Tag xrpl.consensus.mode=proposing to find rounds where node was proposing.
Prometheus label: xrpl_consensus_mode (used as SpanMetrics dimension).
Ledger Attributes
| Attribute | Type | Set On | Description |
|---|---|---|---|
xrpl.ledger.seq |
int64 | ledger.build, ledger.validate, ledger.store, tx.apply |
Ledger sequence number |
xrpl.ledger.validations |
int64 | ledger.validate |
Number of validations received for this ledger |
xrpl.ledger.tx_count |
int64 | ledger.build, tx.apply |
Transactions in the ledger |
xrpl.ledger.tx_failed |
int64 | ledger.build, tx.apply |
Failed transactions in the ledger |
Jaeger query: Tag xrpl.ledger.seq=12345 to find all spans for a specific ledger.
Peer Attributes
| Attribute | Type | Set On | Description |
|---|---|---|---|
xrpl.peer.id |
int64 | tx.receive, peer.proposal.receive, peer.validation.receive |
Peer identifier |
xrpl.peer.proposal.trusted |
boolean | peer.proposal.receive |
Whether the proposal came from a trusted validator |
xrpl.peer.validation.trusted |
boolean | peer.validation.receive |
Whether the validation came from a trusted validator |
Prometheus labels: xrpl_peer_proposal_trusted, xrpl_peer_validation_trusted (SpanMetrics dimensions).
1.3 SpanMetrics — Derived Prometheus Metrics
See also: 01-architecture-analysis.md §1.8.2 for how span-derived metrics map to operational insights.
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed.
| Prometheus Metric | Type | Description |
|---|---|---|
traces_span_metrics_calls_total |
Counter | Total span invocations |
traces_span_metrics_duration_milliseconds_bucket |
Histogram | Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms) |
traces_span_metrics_duration_milliseconds_count |
Histogram | Observation count |
traces_span_metrics_duration_milliseconds_sum |
Histogram | Cumulative latency |
Standard labels on every metric: span_name, status_code, service_name, span_kind
Additional dimension labels (configured in otel-collector-config.yaml):
| Span Attribute | Prometheus Label | Applies To |
|---|---|---|
xrpl.rpc.command |
xrpl_rpc_command |
rpc.command.* |
xrpl.rpc.status |
xrpl_rpc_status |
rpc.command.* |
xrpl.consensus.mode |
xrpl_consensus_mode |
consensus.ledger_close |
xrpl.tx.local |
xrpl_tx_local |
tx.process |
xrpl.peer.proposal.trusted |
xrpl_peer_proposal_trusted |
peer.proposal.receive |
xrpl.peer.validation.trusted |
xrpl_peer_validation_trusted |
peer.validation.receive |
Where to query: Prometheus → traces_span_metrics_calls_total{span_name="rpc.command.server_info"}
2. System Metrics (beast::insight — OTel native)
See also: 02-design-decisions.md for the beast::insight coexistence design. 06-implementation-phases.md for the Phase 6/7 metric inventory.
Migration complete: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The
beast::insight::Collectorinterface and all metric names are preserved — only the wire protocol changed.[insight] server=statsdremains as a fallback.
These are system-level metrics emitted by rippled's beast::insight framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.
Configuration
# Recommended: native OTel metrics via OTLP/HTTP
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled
Fallback (StatsD):
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
2.1 Gauges
| Prometheus Metric | Source File | Description | Typical Range |
|---|---|---|---|
rippled_LedgerMaster_Validated_Ledger_Age |
LedgerMaster.h | Seconds since last validated ledger | 0–10 (healthy), >30 (stale) |
rippled_LedgerMaster_Published_Ledger_Age |
LedgerMaster.h | Seconds since last published ledger | 0–10 (healthy) |
rippled_State_Accounting_Disconnected_duration |
NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic |
rippled_State_Accounting_Connected_duration |
NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic |
rippled_State_Accounting_Syncing_duration |
NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic |
rippled_State_Accounting_Tracking_duration |
NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic |
rippled_State_Accounting_Full_duration |
NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) |
rippled_State_Accounting_Disconnected_transitions |
NetworkOPs.cpp | Count of transitions to Disconnected | Low |
rippled_State_Accounting_Connected_transitions |
NetworkOPs.cpp | Count of transitions to Connected | Low |
rippled_State_Accounting_Syncing_transitions |
NetworkOPs.cpp | Count of transitions to Syncing | Low |
rippled_State_Accounting_Tracking_transitions |
NetworkOPs.cpp | Count of transitions to Tracking | Low |
rippled_State_Accounting_Full_transitions |
NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) |
rippled_Peer_Finder_Active_Inbound_Peers |
PeerfinderManager.cpp | Active inbound peer connections | 0–85 |
rippled_Peer_Finder_Active_Outbound_Peers |
PeerfinderManager.cpp | Active outbound peer connections | 10–21 |
rippled_Overlay_Peer_Disconnects |
OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
rippled_job_count |
JobQueue.cpp | Current job queue depth | 0–100 (healthy) |
Grafana dashboard: Node Health (System Metrics) (rippled-system-node-health)
2.2 Counters
| Prometheus Metric | Source File | Description |
|---|---|---|
rippled_rpc_requests |
ServerHandler.cpp | Total RPC requests received |
rippled_ledger_fetches |
InboundLedgers.cpp | Inbound ledger fetch attempts |
rippled_ledger_history_mismatch |
LedgerHistory.cpp | Ledger hash mismatches detected |
rippled_warn |
Logic.h | Resource manager warnings issued |
rippled_drop |
Logic.h | Resource manager drops (connections rejected) |
Note: With server=otel, rippled_warn and rippled_drop are properly exported as OTel Counter instruments. The previous StatsD |m type limitation no longer applies.
Grafana dashboard: RPC & Pathfinding (System Metrics) (rippled-system-rpc)
2.3 Histograms (Event timers)
| Prometheus Metric | Source File | Unit | Description |
|---|---|---|---|
rippled_rpc_time |
ServerHandler.cpp | ms | RPC response time distribution |
rippled_rpc_size |
ServerHandler.cpp | bytes | RPC response size distribution |
rippled_ios_latency |
Application.cpp | ms | I/O service loop latency |
rippled_pathfind_fast |
PathRequests.h | ms | Fast pathfinding duration |
rippled_pathfind_full |
PathRequests.h | ms | Full pathfinding duration |
Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.
Grafana dashboards: Node Health (ios_latency), RPC & Pathfinding (rpc_time, rpc_size, pathfind_*)
2.4 Overlay Traffic Metrics
For each of the 45+ overlay traffic categories (defined in TrafficCount.h), four gauges are emitted:
rippled_{category}_Bytes_Inrippled_{category}_Bytes_Outrippled_{category}_Messages_Inrippled_{category}_Messages_Out
Key categories:
| Category | Description |
|---|---|
total |
All traffic aggregated |
overhead / overhead_overlay |
Protocol overhead |
transactions / transactions_duplicate |
Transaction relay |
proposals / proposals_untrusted / proposals_duplicate |
Consensus proposals |
validations / validations_untrusted / validations_duplicate |
Consensus validations |
ledger_data_get / ledger_data_share |
Ledger data exchange |
ledger_data_Transaction_Node_get/share |
Transaction node data |
ledger_data_Account_State_Node_get/share |
Account state node data |
ledger_data_Transaction_Set_candidate_get/share |
Transaction set candidates |
getObject / haveTxSet / ledgerData |
Object requests |
ping / status |
Keepalive and status |
set_get |
Set requests |
Grafana dashboards: Network Traffic (rippled-system-network), Overlay Traffic Detail (rippled-system-overlay-detail), Ledger Data & Sync (rippled-system-ledger-sync)
3. Grafana Dashboard Reference
See also: 05-configuration-reference.md §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples.
3.1 Span-Derived Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
|---|---|---|---|
| RPC Performance | rippled-rpc-perf |
Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands |
| Transaction Overview | rippled-transactions |
Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap |
| Consensus Health | rippled-consensus |
Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap |
| Ledger Operations | rippled-ledger-ops |
Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
| Peer Network | rippled-peer-net |
Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
3.2 System Metrics Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
|---|---|---|---|
| Node Health | rippled-system-node-health |
Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
| Network Traffic | rippled-system-network |
Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
| RPC & Pathfinding | rippled-system-rpc |
Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
| Overlay Traffic Detail | rippled-system-overlay-detail |
Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
| Ledger Data & Sync | rippled-system-ledger-sync |
Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
3.3 Accessing the Dashboards
- Open Grafana at http://localhost:3000
- Navigate to Dashboards → rippled folder
- All 10 dashboards are auto-provisioned from
docker/telemetry/grafana/dashboards/
4. Jaeger Trace Search Guide
See also: 08-appendix.md §8.2 for span hierarchy visualizations. 05-configuration-reference.md §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger.
Finding Traces by Type
| What to Find | Jaeger Search Parameters |
|---|---|
| All RPC calls | Service: rippled, Operation: rpc.request |
| Specific RPC command | Operation: rpc.command.server_info (or any command name) |
| Slow RPC calls | Operation: rpc.command.*, Min Duration: 100ms |
| Failed RPC calls | Tag: xrpl.rpc.status=error |
| Specific transaction | Tag: xrpl.tx.hash=<hex_hash> |
| Local transactions only | Tag: xrpl.tx.local=true |
| Consensus rounds | Operation: consensus.accept |
| Rounds by mode | Tag: xrpl.consensus.mode=proposing |
| Specific ledger | Tag: xrpl.ledger.seq=12345 |
| Peer proposals (trusted) | Tag: xrpl.peer.proposal.trusted=true |
Trace Structure
A typical RPC trace shows the span hierarchy:
rpc.request (ServerHandler)
└── rpc.process (ServerHandler)
└── rpc.command.server_info (RPCHandler)
A consensus round produces independent spans (not parent-child):
consensus.ledger_close (close event)
consensus.proposal.send (broadcast proposal)
ledger.build (build new ledger)
└── tx.apply (apply transaction set)
consensus.accept (accept result)
consensus.validation.send (send validation)
ledger.validate (promote to validated)
ledger.store (persist to DB)
5. Prometheus Query Examples
See also: 05-configuration-reference.md §5.8.7 for correlating Prometheus system metrics with trace-derived metrics.
Span-Derived Metrics
# RPC request rate by command (last 5 minutes)
sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))
# RPC p95 latency by command
histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))
# Consensus round duration p95
histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m])))
# Transaction processing rate (local vs relay)
sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))
# Trusted vs untrusted proposal rate
sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))
StatsD Metrics
# Validated ledger age (should be < 10s)
rippled_LedgerMaster_Validated_Ledger_Age
# Active peer count
rippled_Peer_Finder_Active_Inbound_Peers + rippled_Peer_Finder_Active_Outbound_Peers
# RPC response time p95
histogram_quantile(0.95, rippled_rpc_time_bucket)
# Total network bytes in (rate)
rate(rippled_total_Bytes_In[5m])
# Operating mode (should be "Full" after startup)
rippled_State_Accounting_Full_duration
5a. Log-Trace Correlation (Phase 8)
Plan details: 06-implementation-phases.md §6.8.1 — motivation, architecture, Mermaid diagrams Task breakdown: Phase8_taskList.md — per-task implementation details
Phase 8 injects OTel trace context into rippled's Logs::format() output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:
Log Format
<timestamp> <partition>:<severity> trace_id=<32hex> span_id=<16hex> <message>
Example:
2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42
trace_id=<hex32>— 32-character lowercase hex trace identifier. Links to the distributed trace in Tempo/Jaeger.span_id=<hex16>— 16-character lowercase hex span identifier. Identifies the specific span within the trace.- Only present when the log is emitted within an active OTel span. Log lines outside of traced code paths have no trace context fields.
Implementation
The trace context injection is implemented in Logs::format() (src/libxrpl/basics/Log.cpp), guarded by #ifdef XRPL_ENABLE_TELEMETRY. It reads the current span from OTel's thread-local runtime context via opentelemetry::trace::GetSpan() and opentelemetry::context::RuntimeContext::GetCurrent(). Both calls are lock-free thread-local reads measured at <10ns per call.
Log Ingestion Pipeline
rippled debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki
The OTel Collector's filelog receiver tails debug.log files and uses a regex_parser operator to extract structured fields:
| Field | Type | Description |
|---|---|---|
timestamp |
datetime | Log timestamp |
partition |
string | Log partition (e.g., LedgerMaster, PeerImp) |
severity |
string | Severity code (TRC, DBG, NFO, WRN, ERR, FTL) |
trace_id |
string | 32-hex trace identifier (optional) |
span_id |
string | 16-hex span identifier (optional) |
message |
string | Log message body |
Grafana Correlation
Bidirectional linking between logs and traces is configured via Grafana datasource provisioning:
- Tempo -> Loki (
tracesToLogs): Clicking "Logs for this trace" on a Tempo trace view filters Loki logs bytrace_id, showing all log lines from that trace. - Loki -> Tempo (
derivedFields): A regex-based derived field on the Loki datasource extractstrace_idfrom log lines and renders it as a clickable link to the corresponding trace in Tempo.
Loki Backend
Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries from the OTel Collector's loki exporter via the push API at http://loki:3100/loki/api/v1/push.
LogQL Query Examples
# Find all logs for a specific trace
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"
# Error logs with trace context
{job="rippled"} |= "ERR" |= "trace_id="
# Logs from a specific partition with trace context
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""
# Count traced log lines over time
count_over_time({job="rippled"} |= "trace_id=" [5m])
5b. Future: Internal Metric Gap Fill (Phase 9)
Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.2 — motivation, architecture, third-party context Task breakdown: Phase9_taskList.md — per-task implementation details
Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: beast::insight extensions for NodeStore I/O, OTel ObservableGauge async callbacks for new categories.
New Metric Categories
NodeStore I/O (via beast::insight)
| Prometheus Metric | Type | Description |
|---|---|---|
rippled_nodestore_reads_total |
Gauge | Cumulative read operations |
rippled_nodestore_reads_hit |
Gauge | Cache-served reads |
rippled_nodestore_writes |
Gauge | Cumulative write operations |
rippled_nodestore_written_bytes |
Gauge | Cumulative bytes written |
rippled_nodestore_read_bytes |
Gauge | Cumulative bytes read |
rippled_nodestore_read_duration_us |
Gauge | Cumulative read time (microseconds) |
rippled_nodestore_write_load |
Gauge | Current write load score |
rippled_nodestore_read_queue |
Gauge | Items in read queue |
Cache Hit Rates (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
|---|---|---|
rippled_cache_SLE_hit_rate |
Gauge | SLE cache hit rate (0.0-1.0) |
rippled_cache_ledger_hit_rate |
Gauge | Ledger object cache hit rate |
rippled_cache_AL_hit_rate |
Gauge | AcceptedLedger cache hit rate |
rippled_cache_treenode_size |
Gauge | SHAMap TreeNode cache size (entries) |
rippled_cache_fullbelow_size |
Gauge | FullBelow cache size |
Transaction Queue (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
|---|---|---|
rippled_txq_count |
Gauge | Current transactions in queue |
rippled_txq_max_size |
Gauge | Maximum queue capacity |
rippled_txq_in_ledger |
Gauge | Transactions in open ledger |
rippled_txq_per_ledger |
Gauge | Expected transactions per ledger |
rippled_txq_open_ledger_fee_level |
Gauge | Open ledger fee escalation level |
rippled_txq_med_fee_level |
Gauge | Median fee level in queue |
rippled_txq_reference_fee_level |
Gauge | Reference fee level |
rippled_txq_min_processing_fee_level |
Gauge | Minimum fee to get processed |
PerfLog Per-RPC Method (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_rpc_method_started_total |
Counter | method="<name>" |
RPC calls started |
rippled_rpc_method_finished_total |
Counter | method="<name>" |
RPC calls completed |
rippled_rpc_method_errored_total |
Counter | method="<name>" |
RPC calls errored |
rippled_rpc_method_duration_us_bucket |
Histogram | method="<name>" |
Execution time distribution |
PerfLog Per-Job Type (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_job_queued_total |
Counter | job_type="<name>" |
Jobs queued |
rippled_job_started_total |
Counter | job_type="<name>" |
Jobs started |
rippled_job_finished_total |
Counter | job_type="<name>" |
Jobs completed |
rippled_job_queued_duration_us_bucket |
Histogram | job_type="<name>" |
Queue wait time |
rippled_job_running_duration_us_bucket |
Histogram | job_type="<name>" |
Execution time |
Counted Object Instances (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_object_count |
Gauge | type="<name>" |
Live instances of internal type |
Tracked types: Transaction, Ledger, NodeObject, STTx, STLedgerEntry, InboundLedger, Pathfinder, PathRequest, HashRouterEntry
Fee Escalation & Load Factors (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
|---|---|---|
rippled_load_factor |
Gauge | Combined transaction cost multiplier |
rippled_load_factor_server |
Gauge | Server + cluster + network load |
rippled_load_factor_local |
Gauge | Local server load only |
rippled_load_factor_net |
Gauge | Network-wide load estimate |
rippled_load_factor_cluster |
Gauge | Cluster peer load |
rippled_load_factor_fee_escalation |
Gauge | Open ledger fee escalation |
rippled_load_factor_fee_queue |
Gauge | Queue entry fee level |
New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
|---|---|---|---|
| Fee Market & TxQ | rippled-fee-market |
Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation |
| Job Queue Analysis | rippled-job-queue |
Prometheus | Per-job rates, queue wait times, execution times, queue depth |
5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)
Plan details: 06-implementation-phases.md §6.8.3 — motivation, architecture Task breakdown: Phase10_taskList.md — per-task implementation details Tools: docker/telemetry/workload/ — RPC load generator, transaction submitter, validation suite, benchmarks
Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead.
Running the Validation Suite
# Full end-to-end validation (start cluster, generate load, validate):
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld
# Validation only (assumes stack and cluster are already running):
python3 docker/telemetry/workload/validate_telemetry.py --report /tmp/report.json
# Performance benchmark (baseline vs telemetry):
docker/telemetry/workload/benchmark.sh --xrpld .build/xrpld --duration 300
Validated Telemetry Inventory
| Category | Expected Count | Validation Method | Config File |
|---|---|---|---|
| Trace spans | 17 | Jaeger/Tempo API query | expected_spans.json |
| Span attributes | 22 | Per-span attribute assertion | expected_spans.json |
| StatsD metrics | 255+ | Prometheus query | expected_metrics.json |
| Phase 9 metrics | 50+ | Prometheus query | expected_metrics.json |
| SpanMetrics RED | 4 per span | Prometheus query | expected_metrics.json |
| Grafana dashboards | 10 | Dashboard API "no data" check | expected_metrics.json |
| Log-trace links | Present | Loki query + Tempo reverse check | — |
Performance Overhead Targets
| Metric | Target | Measurement Method |
|---|---|---|
| CPU overhead | < 3% | ps avg CPU% baseline vs telemetry |
| Memory overhead | < 5MB | ps peak RSS baseline vs telemetry |
| RPC p99 latency | < 2ms impact | server_info round-trip timing |
| Throughput impact | < 5% | Ledger close rate comparison |
| Consensus impact | < 1% | Consensus round time p95 comparison |
5d. Future: Third-Party Data Collection Pipelines (Phase 11)
Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.4 — motivation, architecture, consumer gap analysis Task breakdown: Phase11_taskList.md — per-task implementation details
Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports xrpl_* metrics for external consumers. No rippled code changes.
Exported Metrics (via Custom OTel Collector Receiver)
Node Health (from server_info)
| Prometheus Metric | Type | Description |
|---|---|---|
xrpl_server_state |
Gauge | Operating mode (0=disconnected ... 5=proposing) |
xrpl_server_state_duration_seconds |
Gauge | Seconds in current state |
xrpl_uptime_seconds |
Gauge | Consecutive seconds running |
xrpl_io_latency_ms |
Gauge | I/O subsystem latency |
xrpl_amendment_blocked |
Gauge | 1 if amendment-blocked, 0 otherwise |
xrpl_peers_count |
Gauge | Connected peers |
xrpl_validated_ledger_seq |
Gauge | Latest validated ledger sequence |
xrpl_validated_ledger_age_seconds |
Gauge | Seconds since last validated close |
xrpl_last_close_proposers |
Gauge | Proposers in last consensus round |
xrpl_last_close_converge_time_seconds |
Gauge | Last consensus round duration |
xrpl_load_factor |
Gauge | Transaction cost multiplier |
xrpl_state_duration_seconds |
Gauge | Per-state duration (state label) |
xrpl_state_transitions_total |
Gauge | Per-state transition count (state label) |
Peer Topology (from peers)
| Prometheus Metric | Type | Description |
|---|---|---|
xrpl_peers_inbound_count |
Gauge | Inbound peer connections |
xrpl_peers_outbound_count |
Gauge | Outbound peer connections |
xrpl_peer_latency_p50_ms |
Gauge | Median peer latency |
xrpl_peer_latency_p95_ms |
Gauge | p95 peer latency |
xrpl_peer_version_count |
Gauge | Peers per version (version label) |
xrpl_peer_diverged_count |
Gauge | Peers with diverged tracking status |
Validator & Amendment (from validators, feature)
| Prometheus Metric | Type | Description |
|---|---|---|
xrpl_trusted_validators_count |
Gauge | UNL validator count |
xrpl_amendment_enabled_count |
Gauge | Enabled amendments |
xrpl_amendment_majority_count |
Gauge | Amendments with majority |
xrpl_amendment_unsupported_majority |
Gauge | 1 if unsupported amendment has majority |
xrpl_validator_list_active |
Gauge | 1 if validator list is active |
Fee Market (from fee)
| Prometheus Metric | Type | Description |
|---|---|---|
xrpl_fee_open_ledger_fee_drops |
Gauge | Minimum fee for open ledger inclusion |
xrpl_fee_median_fee_drops |
Gauge | Median fee level |
xrpl_fee_queue_size |
Gauge | Current transaction queue depth |
xrpl_fee_current_ledger_size |
Gauge | Transactions in current open ledger |
DEX & AMM (optional, from book_offers, amm_info)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
xrpl_amm_tvl_drops |
Gauge | pool="<id>" |
Total value locked |
xrpl_amm_trading_fee |
Gauge | pool="<id>" |
Pool trading fee (bps) |
xrpl_orderbook_bid_depth |
Gauge | pair="<base/quote>" |
Total bid volume |
xrpl_orderbook_ask_depth |
Gauge | pair="<base/quote>" |
Total ask volume |
xrpl_orderbook_spread |
Gauge | pair="<base/quote>" |
Best bid-ask spread |
Phase 9: OTel SDK-Exported Metrics (MetricsRegistry)
Phase 9 introduces the MetricsRegistry class (src/xrpld/telemetry/MetricsRegistry.h/.cpp)
which registers metrics directly with the OpenTelemetry Metrics SDK. These are exported
via OTLP/HTTP to the OTel Collector and scraped by Prometheus.
NodeStore I/O (Observable Gauge — nodestore_state)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_nodestore_state{metric="node_reads_total"} |
Gauge | metric |
Cumulative NodeStore read operations |
rippled_nodestore_state{metric="node_reads_hit"} |
Gauge | metric |
Reads served from cache |
rippled_nodestore_state{metric="node_writes"} |
Gauge | metric |
Cumulative write operations |
rippled_nodestore_state{metric="node_written_bytes"} |
Gauge | metric |
Cumulative bytes written |
rippled_nodestore_state{metric="node_read_bytes"} |
Gauge | metric |
Cumulative bytes read |
rippled_nodestore_state{metric="write_load"} |
Gauge | metric |
Current write load score |
rippled_nodestore_state{metric="read_queue"} |
Gauge | metric |
Items in read prefetch queue |
Cache Hit Rates & Sizes (Observable Gauge — cache_metrics)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_cache_metrics{metric="SLE_hit_rate"} |
Gauge | metric |
SLE cache hit rate (0.0-1.0) |
rippled_cache_metrics{metric="ledger_hit_rate"} |
Gauge | metric |
Ledger cache hit rate |
rippled_cache_metrics{metric="AL_hit_rate"} |
Gauge | metric |
AcceptedLedger cache hit rate |
rippled_cache_metrics{metric="treenode_cache_size"} |
Gauge | metric |
SHAMap TreeNode cache entries |
rippled_cache_metrics{metric="treenode_track_size"} |
Gauge | metric |
Tracked tree nodes |
rippled_cache_metrics{metric="fullbelow_size"} |
Gauge | metric |
FullBelow cache entries |
Transaction Queue (Observable Gauge — txq_metrics)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_txq_metrics{metric="txq_count"} |
Gauge | metric |
Transactions currently in queue |
rippled_txq_metrics{metric="txq_max_size"} |
Gauge | metric |
Maximum queue capacity |
rippled_txq_metrics{metric="txq_in_ledger"} |
Gauge | metric |
Transactions in open ledger |
rippled_txq_metrics{metric="txq_per_ledger"} |
Gauge | metric |
Expected transactions per ledger |
rippled_txq_metrics{metric="txq_reference_fee_level"} |
Gauge | metric |
Reference fee level |
rippled_txq_metrics{metric="txq_min_processing_fee_level"} |
Gauge | metric |
Minimum fee to get processed |
rippled_txq_metrics{metric="txq_med_fee_level"} |
Gauge | metric |
Median fee level in queue |
rippled_txq_metrics{metric="txq_open_ledger_fee_level"} |
Gauge | metric |
Open ledger fee escalation level |
Per-RPC Method Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_rpc_method_started_total |
Counter | method="<name>" |
RPC calls started |
rippled_rpc_method_finished_total |
Counter | method="<name>" |
RPC calls completed successfully |
rippled_rpc_method_errored_total |
Counter | method="<name>" |
RPC calls that errored |
rippled_rpc_method_duration_us |
Histogram | method="<name>" |
Execution time distribution (us) |
Per-Job-Type Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_job_queued_total |
Counter | job_type="<name>" |
Jobs enqueued |
rippled_job_started_total |
Counter | job_type="<name>" |
Jobs started |
rippled_job_finished_total |
Counter | job_type="<name>" |
Jobs completed |
rippled_job_queued_duration_us |
Histogram | job_type="<name>" |
Queue wait time distribution (us) |
rippled_job_running_duration_us |
Histogram | job_type="<name>" |
Execution time distribution (us) |
Counted Object Instances (Observable Gauge — object_count)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_object_count{type="Transaction"} |
Gauge | type="<name>" |
Live Transaction objects |
rippled_object_count{type="Ledger"} |
Gauge | type="<name>" |
Live Ledger objects |
rippled_object_count{type="NodeObject"} |
Gauge | type="<name>" |
Live NodeObject instances |
rippled_object_count{type="STTx"} |
Gauge | type="<name>" |
Serialized transaction objects |
rippled_object_count{type="STLedgerEntry"} |
Gauge | type="<name>" |
Serialized ledger entries |
rippled_object_count{type="InboundLedger"} |
Gauge | type="<name>" |
Ledgers being fetched |
rippled_object_count{type="Pathfinder"} |
Gauge | type="<name>" |
Active pathfinding operations |
rippled_object_count{type="PathRequest"} |
Gauge | type="<name>" |
Active path requests |
rippled_object_count{type="HashRouterEntry"} |
Gauge | type="<name>" |
Hash router entries |
Load Factor Breakdown (Observable Gauge — load_factor_metrics)
| Prometheus Metric | Type | Labels | Description |
|---|---|---|---|
rippled_load_factor_metrics{metric="load_factor"} |
Gauge | metric |
Combined transaction cost multiplier |
rippled_load_factor_metrics{metric="load_factor_server"} |
Gauge | metric |
Server + cluster + network contribution |
rippled_load_factor_metrics{metric="load_factor_local"} |
Gauge | metric |
Local server load only |
rippled_load_factor_metrics{metric="load_factor_net"} |
Gauge | metric |
Network-wide load estimate |
rippled_load_factor_metrics{metric="load_factor_cluster"} |
Gauge | metric |
Cluster peer load |
rippled_load_factor_metrics{metric="load_factor_fee_escalation"} |
Gauge | metric |
Open ledger fee escalation |
rippled_load_factor_metrics{metric="load_factor_fee_queue"} |
Gauge | metric |
Queue entry fee level |
Prometheus Query Examples (Phase 9)
# NodeStore cache hit ratio
rippled_nodestore_state{metric="node_reads_hit"} / rippled_nodestore_state{metric="node_reads_total"}
# RPC error rate for server_info
rate(rippled_rpc_method_errored_total{method="server_info"}[5m])
# Job queue wait time p95
histogram_quantile(0.95, sum by (le) (rate(rippled_job_queued_duration_us_bucket[5m])))
# TxQ utilization percentage
rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"}
# High load factor alert candidate
rippled_load_factor_metrics{metric="load_factor"} > 5
New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
|---|---|---|---|
| Fee Market & TxQ | rippled-fee-market |
Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | rippled-job-queue |
Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | rippled-rpc-perf |
Prometheus | Per-method call rates, error rates, latency distributions |
Updated Grafana Dashboards (Phase 9)
| Dashboard | UID | New Panels Added |
|---|---|---|
| Node Health (StatsD) | rippled-statsd-node-health |
NodeStore I/O, cache hit rates, object instance counts |
New Grafana Dashboards (Phase 11)
| Dashboard | UID | Data Source | Key Panels |
|---|---|---|---|
| Validator Health | rippled-validator-health |
Prometheus | Server state timeline, proposer count, converge time, amendment voting |
| Network Topology | rippled-network-topology |
Prometheus | Peer count, version distribution, latency distribution, diverged peers |
| Fee Market (Ext) | rippled-fee-market-external |
Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline |
| DEX & AMM Overview | rippled-dex-amm |
Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue |
Prometheus Alerting Rules (Phase 11)
| Alert Name | Severity | Condition | For |
|---|---|---|---|
XRPLServerNotFull |
Critical | xrpl_server_state < 4 for 15m |
15m |
XRPLAmendmentBlocked |
Critical | xrpl_amendment_blocked == 1 |
1m |
XRPLNoPeers |
Critical | xrpl_peers_count == 0 |
5m |
XRPLLedgerStale |
Critical | xrpl_validated_ledger_age_seconds > 120 |
2m |
XRPLHighIOLatency |
Critical | xrpl_io_latency_ms > 100 |
5m |
XRPLUnsupportedAmendmentMajority |
Critical | xrpl_amendment_unsupported_majority == 1 |
1m |
XRPLLowPeerCount |
Warning | xrpl_peers_count < 10 |
15m |
XRPLHighLoadFactor |
Warning | xrpl_load_factor > 10 |
10m |
XRPLSlowConsensus |
Warning | xrpl_last_close_converge_time_seconds > 6 |
5m |
XRPLValidatorListExpiring |
Warning | (xrpl_validator_list_expiration_seconds - time()) < 86400 |
1h |
XRPLStateFlapping |
Warning | rate(xrpl_state_transitions_total{state="full"}[1h]) > 2 |
30m |
6. Known Issues
| Issue | Impact | Status |
|---|---|---|
warn and drop metrics use non-standard StatsD |m meter type |
Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs |m → |c change in StatsDCollector.cpp |
rippled_job_count may not emit in standalone mode |
Missing from Prometheus in some test configs | Requires active job queue activity |
rippled_rpc_requests depends on [insight] config |
Zero series if StatsD not configured | Requires [insight] server=statsd in xrpld.cfg |
| Peer tracing disabled by default | No peer.* spans unless trace_peer=1 |
Intentional — high volume on mainnet |
7. Privacy and Data Collection
The telemetry system is designed with privacy in mind:
- No private keys are ever included in spans or metrics
- No account balances or financial data is traced
- Transaction hashes are included (public on-ledger data) but not transaction contents
- Peer IDs are internal identifiers, not IP addresses
- All telemetry is opt-in — disabled by default at build time (
-Dtelemetry=OFF) - Sampling reduces data volume —
sampling_ratio=0.01recommended for production - Data stays local — the default stack sends data to
localhostonly
8. Configuration Quick Reference
Full reference: 05-configuration-reference.md §5.1 for all
[telemetry]options with defaults, the config parser implementation, and collector YAML configurations (dev and production).
Minimal Setup (development)
[telemetry]
enabled=1
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
Production Setup
[telemetry]
enabled=1
endpoint=http://otel-collector:4318/v1/traces
sampling_ratio=0.01
trace_peer=0
batch_size=1024
max_queue_size=4096
[insight]
server=statsd
address=otel-collector:8125
prefix=rippled
Trace Category Toggle
| Config Key | Default | Controls |
|---|---|---|
trace_rpc |
1 |
rpc.* spans |
trace_transactions |
1 |
tx.* spans |
trace_consensus |
1 |
consensus.* spans |
trace_ledger |
1 |
ledger.* spans |
trace_peer |
0 |
peer.* spans (high volume) |