71 KiB
Implementation Phases
Parent Document: OpenTelemetryPlan.md Related: Configuration Reference | Observability Backends
6.1 Phase Overview
TxQ = Transaction Queue
gantt
title OpenTelemetry Implementation Timeline
dateFormat YYYY-MM-DD
axisFormat Week %W
section Phase 1
Core Infrastructure :p1, 2024-01-01, 2w
SDK Integration :p1a, 2024-01-01, 4d
Telemetry Interface :p1b, after p1a, 3d
Configuration & CMake :p1c, after p1b, 3d
Unit Tests :p1d, after p1c, 2d
Buffer & Integration :p1e, after p1d, 2d
section Phase 2
RPC Tracing :p2, after p1, 2w
HTTP Context Extraction :p2a, after p1, 2d
RPC Handler Instrumentation :p2b, after p2a, 4d
PathFinding Instrumentation :p2f, after p2b, 2d
TxQ Instrumentation :p2g, after p2f, 2d
WebSocket Support :p2c, after p2g, 2d
Integration Tests :p2d, after p2c, 2d
Buffer & Review :p2e, after p2d, 4d
section Phase 3
Transaction Tracing :p3, after p2, 2w
Protocol Buffer Extension :p3a, after p2, 2d
PeerImp Instrumentation :p3b, after p3a, 3d
Fee Escalation Instrumentation :p3f, after p3b, 2d
Relay Context Propagation :p3c, after p3f, 3d
Multi-node Tests :p3d, after p3c, 2d
Buffer & Review :p3e, after p3d, 4d
section Phase 4
Consensus Tracing :p4, after p3, 2w
Consensus Round Spans :p4a, after p3, 3d
Proposal Handling :p4b, after p4a, 3d
Establish Phase (4a) :p4f, after p4b, 3d
Validation Tests :p4c, after p4f, 4d
Buffer & Review :p4e, after p4c, 4d
section Phase 5
Documentation & Deploy :p5, after p4, 1w
section Phase 6
StatsD Metrics Bridge :p6, after p5, 1w
section Phase 7
Native OTel Metrics :p7, after p6, 2w
section Phase 8
Log-Trace Correlation :p8, after p7, 1w
section Phase 9 (Future)
Internal Metric Gap Fill :p9, after p8, 2.5w
section Phase 10 (Future)
Workload Validation :p10, after p9, 2w
section Phase 11 (Future)
Third-Party Collection :p11, after p10, 3w
6.2 Phase 1: Core Infrastructure (Weeks 1-2)
Objective: Establish foundational telemetry infrastructure
Tasks
| Task | Description |
|---|---|
| 1.1 | Add OpenTelemetry C++ SDK to Conan/CMake |
| 1.2 | Implement Telemetry interface and factory |
| 1.3 | Implement SpanGuard RAII wrapper |
| 1.4 | Implement configuration parser |
| 1.5 | Integrate into ApplicationImp |
| 1.6 | Add conditional compilation (XRPL_ENABLE_TELEMETRY) |
| 1.7 | Create NullTelemetry no-op implementation |
| 1.8 | Unit tests for core infrastructure |
Exit Criteria
- OpenTelemetry SDK compiles and links
- Telemetry can be enabled/disabled via config
- Basic span creation works
- No performance regression when disabled
- Unit tests passing
6.3 Phase 2: RPC Tracing (Weeks 3-4)
TxQ = Transaction Queue
Objective: Complete tracing for all RPC operations
Tasks
| Task | Description |
|---|---|
| 2.1 | Implement W3C Trace Context HTTP header extraction |
| 2.2 | Instrument ServerHandler::onRequest() |
| 2.3 | Instrument RPCHandler::doCommand() |
| 2.4 | Add RPC-specific attributes |
| 2.5 | Instrument WebSocket handler |
| 2.6 | PathFinding instrumentation (pathfind.request, pathfind.compute spans) |
| 2.7 | TxQ instrumentation (txq.enqueue, txq.apply spans) |
| 2.8 | Integration tests for RPC tracing |
| 2.9 | Performance benchmarks |
| 2.10 | Documentation |
Exit Criteria
- All RPC commands traced
- Trace context propagates from HTTP headers
- WebSocket and HTTP both instrumented
- <1ms overhead per RPC call
- Integration tests passing
6.4 Phase 3: Transaction Tracing (Weeks 5-6)
Objective: Trace transaction lifecycle across network with deterministic cross-node correlation
Tasks
| Task | Description |
|---|---|
| 3.1 | Define TraceContext Protocol Buffer message |
| 3.2 | Implement protobuf context serialization |
| 3.3 | Instrument PeerImp::handleTransaction() |
| 3.4 | Instrument NetworkOPs::submitTransaction() |
| 3.5 | Instrument HashRouter integration |
| 3.6 | Fee escalation instrumentation (fee.escalate span) |
| 3.7 | Implement relay context propagation |
| 3.8 | Integration tests (multi-node) |
| 3.9 | Deterministic transaction trace ID (trace_id = txHash[0:16]) |
| 3.10 | Performance benchmarks |
Deterministic Trace ID (Task 3.9)
Transaction spans use deterministic trace IDs derived from the transaction hash:
trace_id = txHash[0:16]. All nodes handling the same transaction independently
produce spans under the same trace_id. Protobuf span_id propagation (Task 3.7)
additionally provides parent-child relay ordering when available. See
02-design-decisions.md §2.5.0 for the design rationale
and Phase3_taskList.md Task 3.9 for the full implementation spec.
Exit Criteria
- Transaction traces span across nodes
- Trace context in Protocol Buffer messages
- HashRouter deduplication visible in traces
- Multi-node integration tests passing
- <5% overhead on transaction throughput
- Deterministic trace_id: all nodes produce same trace_id for same transaction
- Protobuf span_id propagation preserves parent-child ordering when available
6.5 Phase 4: Consensus Tracing (Weeks 7-8)
Objective: Full observability into consensus rounds
Tasks
| Task | Description | Status |
|---|---|---|
| 4.1 | Instrument RCLConsensusAdaptor::startRound() |
✅ Done (via 4a.2) |
| 4.2 | Instrument phase transitions | ✅ Done |
| 4.3 | Instrument proposal handling | ✅ Done |
| 4.4 | Instrument validation handling | ✅ Done |
| 4.5 | Add consensus-specific attributes | ✅ Done |
| 4.6 | Correlate with transaction traces | ✅ Done |
| 4.7 | Build verification and testing | ✅ Done |
| 4.8 | Validation span enrichment (ext. dashboard) | ❌ Not done |
Note: The original plan doc listed tasks 4.7-4.11 as "Validator list tracing", "Amendment voting tracing", "SHAMap sync tracing", "Multi-validator integration tests", and "Performance validation". These were descoped and replaced by the tasklist's 4.7 (build verification) and 4.8 (validation span enrichment). Validator, amendment, and SHAMap tracing are not implemented.
Spans Produced
| Span Name | Location | Attributes |
|---|---|---|
consensus.phase.open |
Consensus.h:707 |
(none) |
consensus.proposal.send |
RCLConsensus.cpp:232 |
xrpl.consensus.round |
consensus.ledger_close |
RCLConsensus.cpp:341 |
xrpl.consensus.ledger.seq, xrpl.consensus.mode |
consensus.accept |
RCLConsensus.cpp:492 |
xrpl.consensus.proposers, xrpl.consensus.round_time_ms, xrpl.consensus.quorum |
consensus.accept.apply |
RCLConsensus.cpp:541 |
xrpl.consensus.close_time, close_time_correct, close_resolution_ms, state, proposing, round_time_ms, ledger.seq, parent_close_time, close_time_self, close_time_vote_bins, resolution_direction |
consensus.validation.send |
RCLConsensus.cpp:900 |
xrpl.consensus.ledger.seq, xrpl.consensus.proposing |
Exit Criteria
- Complete consensus round traces
- Phase transitions visible (open, establish, close, accept)
- Proposals and validations traced — send and receive; relay deferred to Phase 4b
- Close time agreement tracked (per
avCT_CONSENSUS_PCT) - No impact on consensus timing
- Multi-validator test network validated
- Transaction-consensus correlation (Task 4.6) —
tx.includedevents in doAccept - Validation span enrichment (Task 4.8) — not implemented
Implementation Status — Phase 4a Complete
Phase 4a (establish-phase gap fill & cross-node correlation) adds:
- Deterministic trace ID derived from
previousLedger.id()so all validators in the same round share the sametrace_id(switchable viaconsensus_trace_strategyconfig:"deterministic"or"attribute"). See Configuration Reference for full configuration options. Theconsensus_trace_strategyoption will be documented in the configuration reference as part of Phase 4a implementation. - Round lifecycle spans:
consensus.roundwith round-to-round span links. - Establish phase:
consensus.establish,consensus.update_positions(withdispute.resolveevents),consensus.check(with threshold tracking). - Mode changes:
consensus.mode_changespans. - Validation:
consensus.validation.sendwith span link to round span (thread-safe cross-thread access viaroundSpanContext_snapshot). - Separation of concerns: telemetry extracted to private helpers
(
startRoundTracing,createValidationSpan,startEstablishTracing,updateEstablishTracing,endEstablishTracing).
See Phase4_taskList.md for the full spec and implementation notes.
6.5a Phase 4a: Establish-Phase Gap Fill & Cross-Node Correlation
Objective: Fill tracing gaps in the establish phase and establish cross-node
correlation using deterministic trace IDs derived from previousLedger.id().
Approach: Direct instrumentation in Consensus.h and RCLConsensus.cpp.
All spans use SpanGuard factory methods (span(), hashSpan(), linkedSpan())
with TraceCategory::Consensus gating. No macros used — all tracing via direct
SpanGuard API calls.
Tasks
| Task | Description | Effort | Risk | Status |
|---|---|---|---|---|
| 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | 1d | Medium | ✅ Done (no macros) |
| 4a.1 | Adaptor getTelemetry() method |
0.5d | Low | ⏭️ Skipped (not needed) |
| 4a.2 | Switchable round span with deterministic traceID | 2d | High | ✅ Done |
| 4a.3 | Span members in Consensus.h |
0.5d | Medium | ✅ Done (with deviation) |
| 4a.4 | Instrument phaseEstablish() |
1d | Medium | ✅ Done |
| 4a.5 | Instrument updateOurPositions() |
1d | Medium | ✅ Done |
| 4a.6 | Instrument haveConsensus() (thresholds) |
1d | Medium | ✅ Done |
| 4a.7 | Instrument mode changes | 0.5d | Low | ✅ Done |
| 4a.8 | Reparent existing spans under round | 0.5d | Low | ✅ Done |
| 4a.9 | Build verification and testing | 1d | Low | ✅ Done |
Total Effort: 9 days
Spans Produced
| Span Name | Location | Key Attributes (actually set) |
|---|---|---|
consensus.round |
RCLConsensus.cpp |
round_id, ledger_id, ledger.seq, mode, trace_strategy |
consensus.establish |
Consensus.h |
converge_percent, establish_count, proposers |
consensus.update_positions |
Consensus.h |
converge_percent, proposers, have_close_time_consensus, close_time_threshold, disputes_count, avalanche_threshold |
consensus.check |
Consensus.h |
agree/disagree_count, converge_percent, have_close_time_consensus, threshold_percent, result |
consensus.mode_change |
RCLConsensus.cpp |
mode.old, mode.new |
Exit Criteria
- Establish phase internals traced (establish, update_positions, check spans)
- Establish phase fully traced —
disputes_count,avalanche_threshold, disputeyays/naysall implemented - Cross-node correlation works via deterministic trace_id
- Strategy switchable via config (
deterministic/attribute) - Consecutive rounds linked via follows-from spans
- Build passes with telemetry ON and OFF
- No impact on consensus timing
See Phase4_taskList.md for full task details.
6.5b Phase 4b: Cross-Node Propagation (Future)
Objective: Wire TraceContextPropagator for P2P messages (proposals,
validations) to enable true distributed tracing between nodes.
Status: Design documented, NOT implemented. Protobuf fields (field 1001)
and TraceContextPropagator free functions exist. Wiring deferred until Phase 4a is
validated in a multi-node environment.
Prerequisites: Phase 4a complete and validated.
See Phase4_taskList.md § Phase 4b for full design.
6.6 Phase 5: Documentation & Deployment (Week 9)
Objective: Production readiness
Tasks
| Task | Description | Status |
|---|---|---|
| 5.1 | Operator runbook | Complete |
| 5.2 | Grafana dashboards | Complete |
| 5.3 | Alert definitions | Deferred — post-MVP |
| 5.4 | Collector deployment examples | Complete |
| 5.5 | Developer documentation | Complete |
| 5.6 | Training materials | Deferred — post-MVP |
| 5.7 | Final integration testing | Complete |
6.7 Phase 6: StatsD Metrics Integration (Week 10)
Objective: Bridge xrpld's existing beast::insight StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
Background
xrpld has a mature metrics framework (beast::insight) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that does not overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
Metric Inventory
| Category | Group | Type | Count | Key Metrics |
|---|---|---|---|---|
| Node State | State_Accounting |
Gauge | 10 | *_duration, *_transitions per operating mode |
| Ledger | LedgerMaster |
Gauge | 2 | Validated_Ledger_Age, Published_Ledger_Age |
| Ledger Fetch | — | Counter | 1 | ledger_fetches |
| Ledger History | ledger.history |
Counter | 1 | mismatch |
| RPC | rpc |
Counter+Event | 3 | requests, time (histogram), size (histogram) |
| Job Queue | — | Gauge+Event | 1 + 2×N | job_count, per-job {name} and {name}_q |
| Peer Finder | Peer_Finder |
Gauge | 2 | Active_Inbound_Peers, Active_Outbound_Peers |
| Overlay | Overlay |
Gauge | 1 | Peer_Disconnects |
| Overlay Traffic | per-category | Gauge | 4×57 = 228 | Bytes_In/Out, Messages_In/Out per traffic category |
| Pathfinding | — | Event | 2 | pathfind_fast, pathfind_full (histograms) |
| I/O | — | Event | 1 | ios_latency (histogram) |
| Resource Mgr | — | Meter | 2 | warn, drop (rate counters) |
| Caches | per-cache | Gauge | 2×N | {cache}.size, {cache}.hit_rate |
Total: ~255+ unique metrics (plus dynamic job-type and cache metrics)
Tasks
| Task | Description |
|---|---|
| 6.1 | DEFERRED Fix Meter wire format (|m → |c) in StatsDCollector.cpp — breaking change, tracked separately |
| 6.2 | Add statsd receiver to OTel Collector config |
| 6.3 | Expose UDP port 8125 in docker-compose.yml |
| 6.4 | Add [insight] config to integration test node configs |
| 6.5 | Create "Node Health" Grafana dashboard (16 panels) |
| 6.6 | Create "Network Traffic" Grafana dashboard (10 panels) |
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) |
| 6.8 | Update integration test to verify StatsD metrics in Prometheus |
| 6.9 | Update TESTING.md and telemetry-runbook.md |
Wire Format Fix (Task 6.1) — DEFERRED
The StatsDMeterImpl in StatsDCollector.cpp:706 sends metrics with |m suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change |m to |c (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (warn, drop in Resource Manager).
Status: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom |m type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
New Grafana Dashboards
Node Health (statsd-node-health.json, uid: xrpld-statsd-node-health):
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches, Key Jobs Execution/Dequeue Time, FullBelowCache Size/Hit Rate, Ledger Publish Gap, State Duration Rate, All Jobs Detail
Network Traffic (statsd-network-traffic.json, uid: xrpld-statsd-network):
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories, Duplicate Traffic, All Traffic Categories Detail
RPC & Pathfinding (StatsD) (statsd-rpc-pathfinding.json, uid: xrpld-statsd-rpc):
- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
Exit Criteria
- StatsD metrics visible in Prometheus (
curl localhost:9090/api/v1/query?query=xrpld_LedgerMaster_Validated_Ledger_Age) - All 3 new Grafana dashboards load without errors
- Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
Meter metrics (— DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping)warn,drop) flow correctly after|m→|cfix
6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12)
Objective: Replace StatsDCollector with a native OpenTelemetry Metrics SDK implementation behind the existing beast::insight::Collector interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline.
Motivation: Why Migrate from StatsD to Native OTel Metrics
The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves.
What We Gain
-
Unified telemetry pipeline — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."
-
Eliminates StatsD UDP limitations — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.
-
Fixes the
|mwire format issue — TheStatsDMeterImpluses non-standard|mStatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). -
Richer metric semantics — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.
-
Removes infrastructure dependency — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.
-
Metric-to-trace correlation — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics.
-
Production-grade export — OTel's
PeriodicMetricReaderprovides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled inStatsDCollectorImp.
What We Lose
-
StatsD ecosystem compatibility — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep
server=statsdas a fallback. -
Simplicity of UDP — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.
-
Slightly higher memory — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.
-
Dependency on OTel C++ Metrics SDK stability — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.
Decision
The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. StatsDCollector is retained as a fallback via server=statsd for operators who need StatsD ecosystem compatibility during the transition period.
Architecture
Class Hierarchy (after Phase 7)
beast::insight::Collector (abstract interface — unchanged)
|
+-- StatsDCollector (existing — retained as fallback, deprecated)
| +-- StatsDCounterImpl -> StatsD |c over UDP
| +-- StatsDGaugeImpl -> StatsD |g over UDP
| +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard)
| +-- StatsDEventImpl -> StatsD |ms over UDP
| +-- StatsDHookImpl -> 1s periodic callback
|
+-- NullCollector (existing — unchanged, used when disabled)
| +-- NullCounterImpl -> no-op
| +-- NullGaugeImpl -> no-op
| +-- NullMeterImpl -> no-op
| +-- NullEventImpl -> no-op
| +-- NullHookImpl -> no-op
|
+-- OTelCollector (NEW — Phase 7)
+-- OTelCounterImpl -> otel::Counter<int64_t>
+-- OTelGaugeImpl -> otel::ObservableGauge<uint64_t>
+-- OTelMeterImpl -> otel::Counter<uint64_t>
+-- OTelEventImpl -> otel::Histogram<double>
+-- OTelHookImpl -> 1s periodic callback (same pattern)
Data Flow (after Phase 7)
graph LR
subgraph xrpldNode["xrpld Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN"]
B["beast::insight<br/>OTelCollector"]
end
subgraph collector["OTel Collector :4317 / :4318"]
direction TB
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
BP["Batch Processor"]
SM["SpanMetrics Connector"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backends"]
D["Tempo"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000"]
end
A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1
BP -->|"OTLP/gRPC"| D
SM -->|"RED metrics"| E
R1 -->|"xrpld_* metrics<br/>(native OTLP)"| E
E --> F
D --> F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#d9534f,color:#fff,stroke:#b52d2d
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
Key change: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.
Configuration
# [insight] section — new "otel" server option
[insight]
server=otel # NEW: uses OTel OTLP metrics exporter
prefix=xrpld # metric name prefix (preserved)
# Endpoint and auth inherited from [telemetry] section:
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
The OTelCollector reads the OTLP endpoint from [telemetry] config (replacing /v1/traces with /v1/metrics for the metrics exporter). No additional config keys needed.
Backward compatibility: server=statsd continues to work exactly as before.
See Phase7_taskList.md for detailed per-task breakdown.
Instrument Type Mapping
| beast::insight | OTel Metrics SDK | Rationale |
|---|---|---|
Counter (int64, |c) |
Counter<int64_t> |
Direct 1:1 mapping |
Gauge (uint64, |g) |
ObservableGauge<uint64_t> |
Async callback matches existing Hook polling pattern |
Meter (uint64, |m) |
Counter<uint64_t> |
Fixes non-standard wire format; meters are semantically counters |
Event (ms, |ms) |
Histogram<double> |
Duration distributions with explicit bucket boundaries |
| Hook (1s callback) | PeriodicMetricReader alignment |
Same 1s collection interval |
Tasks
| Task | Description |
|---|---|
| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) |
| 7.2 | Implement OTelCollector class (~400-500 lines) |
| 7.3 | Update CollectorManager — add server=otel |
| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) |
| 7.5 | Preserve metric names in Prometheus (naming strategy) |
| 7.6 | Update Grafana dashboards (if names change) |
| 7.7 | Update integration tests |
| 7.8 | Update documentation (runbook, reference docs) |
Exit Criteria
- All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
server=otelis the default in development docker-composeserver=statsdstill works as a fallback- Existing Grafana dashboards display data correctly
- Integration test passes with OTLP-only metrics pipeline
- No performance regression vs StatsD baseline (< 1% CPU overhead)
- Deferred Task 6.1 (
|mwire format) no longer relevant
6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13)
Motivation
xrpld's beast::Journal logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Tempo traces. Phase 8 bridges this gap by injecting trace context (trace_id, span_id) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver.
Gains
- One-click trace-to-log navigation — Click a trace in Tempo and immediately see the corresponding log lines in Loki, filtered by
trace_id. - Reverse lookup (log-to-trace) — Loki derived fields make
trace_idvalues clickable links back to Tempo. - Unified observability — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance.
- Zero new dependencies in xrpld — Uses existing OTel SDK headers (
GetSpan,GetContext) already linked in Phase 1. - Negligible overhead —
GetSpan()+GetContext()are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min.
Losses / Risks
- Log format change — Existing log parsers that rely on a fixed format will need updating to handle the optional
trace_id=... span_id=...fields. - Loki resource usage — Log ingestion adds storage and memory overhead to the observability stack (mitigated by retention policies).
- Filelog receiver complexity — The regex parser must be kept in sync with the log format; a format change in
Logs::format()could break parsing.
Decision
The correlation value far outweighs the risks. The log format change is backward-compatible (fields are appended only when a span is active), and the filelog receiver regex is straightforward to maintain.
Architecture
Phase 8 has two independent sub-phases that can be developed in parallel:
- Phase 8a (code change): Modify
Logs::format()insrc/libxrpl/basics/Log.cppto appendtrace_id=<hex32> span_id=<hex16>when the current thread has an active OTel span. Guarded by#ifdef XRPL_ENABLE_TELEMETRY. - Phase 8b (infra only): Add Loki to the Docker Compose stack, configure the OTel Collector's
filelogreceiver to tail xrpld's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana Tempo↔Loki bidirectional linking.
Trace ID Injection Flow
flowchart LR
subgraph xrpld["xrpld process"]
JLOG["JLOG(j.info())"]
Format["Logs::format()"]
OTelCtx["OTel Context<br/>(thread-local)"]
JLOG --> Format
OTelCtx -.->|"GetSpan()→GetContext()"| Format
end
subgraph output["Log Output"]
LogLine["2024-01-15T10:30:45.123Z<br/>LedgerMaster:NFO<br/>trace_id=abc123...<br/>span_id=def456...<br/>Validated ledger 42"]
end
Format --> LogLine
style xrpld fill:#1a237e,stroke:#0d1642,color:#fff
style output fill:#1b5e20,stroke:#0d3d14,color:#fff
style JLOG fill:#283593,stroke:#1a237e,color:#fff
style Format fill:#283593,stroke:#1a237e,color:#fff
style OTelCtx fill:#283593,stroke:#1a237e,color:#fff
style LogLine fill:#2e7d32,stroke:#1b5e20,color:#fff
Loki Ingestion Pipeline
flowchart LR
subgraph collector["OTel Collector"]
FR["filelog receiver<br/>tails debug.log"]
RP["regex_parser<br/>extracts trace_id,<br/>span_id, severity"]
BP["batch processor"]
LE["otlp/loki exporter"]
FR --> RP --> BP --> LE
end
LogFile["xrpld<br/>debug.log"] --> FR
LE --> Loki["Grafana Loki<br/>:3100"]
Loki <-->|"derivedFields ↔<br/>tracesToLogs"| Tempo["Grafana Tempo"]
style collector fill:#e65100,stroke:#bf360c,color:#fff
style FR fill:#f57c00,stroke:#e65100,color:#fff
style RP fill:#f57c00,stroke:#e65100,color:#fff
style BP fill:#f57c00,stroke:#e65100,color:#fff
style LE fill:#f57c00,stroke:#e65100,color:#fff
style LogFile fill:#1a237e,stroke:#0d1642,color:#fff
style Loki fill:#4a148c,stroke:#2e0d57,color:#fff
style Tempo fill:#4a148c,stroke:#2e0d57,color:#fff
Tasks
| Task | Description |
|---|---|
| 8.1 | Inject trace_id into Logs::format() |
| 8.2 | Add Loki to Docker Compose stack |
| 8.3 | Add filelog receiver to OTel Collector |
| 8.4 | Configure Grafana trace-to-log correlation |
| 8.5 | Update integration tests |
| 8.6 | Update documentation (runbook, reference docs) |
Parallel work: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3–8.6 are sequential.
Exit Criteria
- Log lines within active spans contain
trace_id=<hex> span_id=<hex> - Log lines outside spans have no trace context (no empty fields)
- Loki ingests xrpld logs via OTel Collector filelog receiver
- Grafana Tempo → Loki one-click correlation works
- Grafana Loki → Tempo reverse lookup works via derived field
- Integration test verifies trace_id presence in logs
- No performance regression from trace_id injection (< 0.1% overhead)
6.8.2 Phase 9: Internal Metric Instrumentation Gap Fill (Weeks 14-15) — Future Enhancement
Status: Planned, not yet implemented.
Motivation
Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~68 metrics that exist inside xrpld's get_counts, server_info, TxQ, PerfLog, and CountedObject systems have no time-series export path. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most — NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts.
Architecture
Hybrid approach — two instrumentation strategies based on proximity to existing code:
flowchart TB
subgraph xrpld["xrpld process"]
subgraph existing["Existing beast::insight registrations"]
NS["NodeStore I/O<br/>(Database.cpp)"]
end
subgraph newreg["New OTel MetricsRegistry"]
CR["Cache Hit Rates<br/>(async gauge callbacks)"]
TQ["TxQ Metrics<br/>(async gauge callbacks)"]
PL["PerfLog RPC/Job<br/>(counters + histograms)"]
CO["CountedObjects<br/>(async gauge callbacks)"]
LF["Load Factors<br/>(async gauge callbacks)"]
end
end
subgraph export["Export Pipelines"]
BI["beast::insight<br/>OTelCollector (Phase 7)"]
OS["OTel Metrics SDK<br/>PeriodicMetricReader"]
end
NS --> BI
CR --> OS
TQ --> OS
PL --> OS
CO --> OS
LF --> OS
BI --> OTLP["OTLP/HTTP :4318<br/>/v1/metrics"]
OS --> OTLP
style xrpld fill:#1a2633,color:#ccc,stroke:#4a90d9
style existing fill:#2a4a6b,color:#fff,stroke:#4a90d9
style newreg fill:#2a4a6b,color:#fff,stroke:#4a90d9
style export fill:#1a3320,color:#ccc,stroke:#5cb85c
style NS fill:#4a90d9,color:#fff,stroke:#2a6db5
style CR fill:#5cb85c,color:#fff,stroke:#3d8b3d
style TQ fill:#5cb85c,color:#fff,stroke:#3d8b3d
style PL fill:#5cb85c,color:#fff,stroke:#3d8b3d
style CO fill:#5cb85c,color:#fff,stroke:#3d8b3d
style LF fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BI fill:#449d44,color:#fff,stroke:#2d6e2d
style OS fill:#449d44,color:#fff,stroke:#2d6e2d
style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e
- beast::insight extensions (blue): NodeStore I/O metrics added near existing
Database.cppregistrations — exported via Phase 7'sOTelCollector. - OTel MetricsRegistry (green): New centralized class using
ObservableGaugeasync callbacks for cache, TxQ, PerfLog, CountedObjects, and load factors — polled at 10s intervals byPeriodicMetricReader.
Third-Party Consumer Context
| Consumer Category | Key Metrics They Need From Phase 9 |
|---|---|
| Exchanges | Fee escalation levels, TxQ depth, settlement latency |
| Payment Processors | Load factors, io_latency, transaction throughput |
| Analytics Providers | NodeStore I/O, cache hit rates, counted objects |
| Validators / Operators | Per-job execution times, PerfLog RPC counters, consensus timing |
| Academic Researchers | Consensus performance time-series, fee market dynamics |
| Institutional Custody | Server health scores, reserve calculations, node availability |
Tasks
| Task | Description |
|---|---|
| 9.1 | NodeStore I/O metrics |
| 9.2 | Cache hit rate metrics + MetricsRegistry |
| 9.3 | TxQ metrics |
| 9.4 | PerfLog per-RPC metrics |
| 9.5 | PerfLog per-job metrics |
| 9.6 | Counted object instance metrics |
| 9.7 | Fee escalation & load factor metrics |
| 9.7a | push_metrics.py parity gauges |
| 9.8 | New Grafana dashboards (2 new, 2 updated) |
| 9.9 | Update documentation |
| 9.10 | Integration tests |
See Phase9_taskList.md for detailed per-task breakdown.
Exit Criteria
- All ~68 new metrics visible in Prometheus via OTLP pipeline
MetricsRegistryclass registers/deregisters cleanly with OTel SDK- 2 new Grafana dashboards operational (Fee Market, Job Queue)
- No performance regression (< 0.5% CPU overhead from new callbacks)
- Documentation updated with full new metric inventory
6.8.3 Phase 10: Synthetic Workload Generation & Telemetry Validation (Weeks 16-17)
Status: In progress.
Motivation
Before the telemetry stack (Phases 1-9) can be considered production-ready, we need automated proof that all spans, attributes, metrics, Grafana dashboards, and log-trace correlation work correctly under realistic load. This phase establishes a reusable CI-integrated validation suite and performance benchmark baseline.
Architecture
The validation uses a 2-node validator cluster running as local processes alongside a Docker Compose telemetry stack (Collector, Tempo, Prometheus, Grafana). Two nodes are sufficient for consensus rounds and peer-to-peer span validation while minimizing CI resource usage.
flowchart LR
subgraph harness["2-Node Validator Cluster (local processes)"]
direction TB
V1["Validator 1"] ~~~ V2["Validator 2"]
end
subgraph telemetry["Docker Compose Telemetry Stack"]
direction TB
COL["OTel Collector<br/>(OTLP + StatsD)"]
JAE["Tempo<br/>(trace search)"]
PROM["Prometheus<br/>(metrics)"]
GRAF["Grafana<br/>(dashboards)"]
end
subgraph generators["Workload Generators"]
RPC["RPC Load Generator<br/>(configurable RPS,<br/>command distribution)"]
TX["Transaction Submitter<br/>(10 tx types via<br/>WebSocket command API)"]
end
subgraph validation["Validation Suite"]
SV["Span Validator<br/>(Tempo API)"]
MV["Metric Validator<br/>(Prometheus API,<br/>all 26 metrics required)"]
DV["Dashboard Validator<br/>(Grafana API)"]
BM["Benchmark Suite<br/>(CPU, memory, latency<br/>ON vs OFF comparison)"]
end
generators --> harness
harness --> telemetry
telemetry --> validation
style harness fill:#1a2633,color:#ccc,stroke:#4a90d9
style telemetry fill:#1a2633,color:#ccc,stroke:#4a90d9
style generators fill:#1a3320,color:#ccc,stroke:#5cb85c
style validation fill:#332a1a,color:#ccc,stroke:#f0ad4e
style V1 fill:#4a90d9,color:#fff,stroke:#2a6db5
style V2 fill:#4a90d9,color:#fff,stroke:#2a6db5
style COL fill:#4a90d9,color:#fff,stroke:#2a6db5
style JAE fill:#4a90d9,color:#fff,stroke:#2a6db5
style PROM fill:#4a90d9,color:#fff,stroke:#2a6db5
style GRAF fill:#4a90d9,color:#fff,stroke:#2a6db5
style RPC fill:#5cb85c,color:#fff,stroke:#3d8b3d
style TX fill:#5cb85c,color:#fff,stroke:#3d8b3d
style SV fill:#f0ad4e,color:#000,stroke:#c78c2e
style MV fill:#f0ad4e,color:#000,stroke:#c78c2e
style DV fill:#f0ad4e,color:#000,stroke:#c78c2e
style BM fill:#f0ad4e,color:#000,stroke:#c78c2e
Key Implementation Details
- Transaction submitter and RPC load generator both use xrpld's native WebSocket command format (
{"command": ...}) — not JSON-RPC format. Response data lives inside"result"with"status"at the top level. - Node config requires
[signing_support] truefor server-side signing, and[ips](not[ips_fixed]) to ensure peer connections count inPeer_Finder_Active_*metrics. - Metric validation uses the Prometheus
/api/v1/seriesendpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric inexpected_metrics.jsonmust have > 0 series. - StatsD gauge fix:
StatsDGaugeImplinitializesm_dirty = trueso all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g.jobq_job_count) would be invisible in Prometheus. - I/O latency fix:
io_latency_sampleremits unconditionally on first sample, then applies the 10 ms threshold. This ensuresios_latencyis registered in Prometheus even in low-load CI environments. - tx.receive span: Sets default attributes (
xrpl.tx.suppressed = false,xrpl.tx.status = "new") on span creation so they are always present. The suppressed/bad code paths override these when applicable.
Tasks
| Task | Description |
|---|---|
| 10.1 | Multi-node test harness (5 validators) |
| 10.2 | RPC load generator |
| 10.3 | Transaction submitter (6+ tx types) |
| 10.4 | Telemetry validation suite |
| 10.5 | Performance benchmark suite |
| 10.6 | CI integration |
| 10.7 | Documentation |
See Phase10_taskList.md for detailed per-task breakdown.
Validation Check Inventory (71 Checks)
The validation suite (validate_telemetry.py) runs exactly 71 checks, broken down as:
- 1 service registration —
xrpldexists in Tempo - 17 span existence —
rpc.request,rpc.process,rpc.ws_message,rpc.command.*,tx.process,tx.receive,tx.apply,consensus.proposal.send,consensus.ledger_close,consensus.accept,consensus.validation.send,consensus.accept.apply,ledger.build,ledger.validate,ledger.store,peer.proposal.receive,peer.validation.receive - 14 span attribute — required attributes on the 14 spans that define them (22 unique attributes total)
- 2 span hierarchies —
rpc.process->rpc.command.*,ledger.build->tx.apply(1 skipped:rpc.request->rpc.process, cross-thread) - 1 span duration bounds — all spans > 0 and < 60 s
- 26 metric existence — 4 SpanMetrics (
traces_span_metrics_calls_total,..._duration_milliseconds_{bucket,count,sum}), 6 StatsD gauges (LedgerMaster_Validated_Ledger_Age,Published_Ledger_Age,State_Accounting_Full_duration,Peer_Finder_Active_{Inbound,Outbound}_Peers,jobq_job_count), 2 StatsD counters (rpc_requests_total,ledger_fetches_total), 3 StatsD histograms (rpc_time,rpc_size,ios_latency), 4 overlay traffic (total_Bytes_{In,Out},total_Messages_{In,Out}), 7 Phase 9 OTLP (nodestore_state,cache_metrics,txq_metrics,rpc_method_{started,finished}_total,object_count,load_factor_metrics) - 10 dashboard loads —
xrpld-rpc-perf,xrpld-transactions,xrpld-consensus,xrpld-ledger-ops,xrpld-peer-net,xrpld-system-node-health,xrpld-system-network,xrpld-system-rpc,xrpld-system-overlay-detail,xrpld-system-ledger-sync
See Phase10_taskList.md for the full numbered check-by-check enumeration.
Current Status
Working (71/71 checks pass in CI): All 17 spans, 26 metrics, 10 dashboards, 14 attribute checks, 2 hierarchies, and duration bounds validated.
Not implemented or not available in CI:
rpc.request->rpc.processparent-child hierarchy — skipped (cross-thread context propagation)- Log-trace correlation validation (Loki) — not included in checks
- Full 255+ StatsD metric coverage — only 26 representative metrics validated
- Sustained load / backpressure testing — not implemented
docs/telemetry-runbook.mdupdates — not done09-data-collection-reference.md"Validation" section — not done- Automated cross-CI baseline persistence — the regression gate reads a
committed baseline; baseline updates flow through a manual PR refresh, not
an artifact promoted from
develop(FU-2).
Exit Criteria
- 2-node validator cluster starts and reaches consensus
- Validation suite confirms all required spans, attributes, and metrics (71/71 checks)
- All 10 Grafana dashboards render data
- Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- CI workflow runs validation on telemetry branch changes
- OTel-driven regression gate: captures per-span/per-RPC/per-job timings from Prometheus and compares against a committed baseline
6.8.4 Phase 11: Third-Party Data Collection Pipelines (Weeks 18-20) — Future Enhancement
Status: Planned, not yet implemented.
Motivation
xrpld has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (server_info, get_counts, fee, peers, validators, feature). Every external consumer — exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms — must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver.
Architecture
flowchart LR
subgraph receiver["Custom OTel Collector Receiver (Go)"]
direction TB
SI["server_info<br/>collector"]
GC["get_counts<br/>collector"]
FE["fee<br/>collector"]
PE["peers<br/>collector"]
VA["validators<br/>collector"]
DX["DEX/AMM<br/>collector<br/>(optional)"]
end
xrpld["xrpld<br/>Admin RPC<br/>:5005"] -->|"JSON-RPC<br/>poll every 30s"| receiver
receiver -->|"xrpl_* metrics"| PROM["Prometheus<br/>:9090"]
receiver -->|"OTLP export"| OTLP["Any OTLP-<br/>compatible<br/>backend"]
PROM --> GF["Grafana<br/>4 new dashboards"]
PROM --> AL["Prometheus<br/>Alerting Rules"]
style receiver fill:#1a3320,color:#ccc,stroke:#5cb85c
style SI fill:#5cb85c,color:#fff,stroke:#3d8b3d
style GC fill:#5cb85c,color:#fff,stroke:#3d8b3d
style FE fill:#5cb85c,color:#fff,stroke:#3d8b3d
style PE fill:#5cb85c,color:#fff,stroke:#3d8b3d
style VA fill:#5cb85c,color:#fff,stroke:#3d8b3d
style DX fill:#449d44,color:#fff,stroke:#2d6e2d
style xrpld fill:#4a90d9,color:#fff,stroke:#2a6db5
style PROM fill:#f0ad4e,color:#000,stroke:#c78c2e
style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e
style GF fill:#5bc0de,color:#000,stroke:#3aa8c1
style AL fill:#d9534f,color:#fff,stroke:#b52d2d
Third-Party Consumer Gap Analysis
| Consumer Category | Data Unlocked by Phase 11 |
|---|---|
| Exchanges | Real-time fee estimates, TxQ capacity, server health scores |
| Payment Processors | Settlement latency percentiles, corridor health |
| Analytics Providers | Validator metrics, network topology, amendment voting status |
| DeFi / AMM | AMM pool TVL, DEX order book depth, trade volumes |
| Validators / Operators | Per-peer latency, version distribution, UNL health, alerting |
| Compliance | Transaction volume trends, network growth metrics |
| Academic Researchers | Consensus performance time-series, decentralization metrics |
| CBDC / Tokenization | Token supply tracking, trust line adoption, freeze status |
| Institutional Custody | Multi-sig status, escrow tracking, reserve calculations |
| Wallet Providers | Server health for node selection, fee prediction data |
Tasks
| Task | Description |
|---|---|
| 11.1 | OTel Collector receiver scaffold (Go) |
| 11.2 | server_info / server_state collector |
| 11.3 | get_counts collector |
| 11.4 | Peer topology collector |
| 11.5 | Validator & amendment collector |
| 11.6 | Fee & TxQ collector |
| 11.7 | DEX & AMM collector (optional) |
| 11.8 | Prometheus alerting rules |
| 11.9 | New Grafana dashboards (4) |
| 11.10 | Integration with Phase 10 validation |
| 11.11 | Documentation |
See Phase11_taskList.md for detailed per-task breakdown.
Exit Criteria
- Custom OTel Collector receiver exports all
xrpl_*metrics to Prometheus - 4 new Grafana dashboards operational (Validator Health, Network Topology, Fee Market, DEX/AMM)
- Prometheus alerting rules fire correctly for simulated failures
- Receiver handles xrpld restart/unavailability gracefully
- Go receiver has unit tests with >80% coverage
6.9 Risk Assessment
quadrantChart
title Risk Assessment Matrix
x-axis Low Impact --> High Impact
y-axis Low Likelihood --> High Likelihood
quadrant-1 Mitigate Immediately
quadrant-2 Plan Mitigation
quadrant-3 Accept Risk
quadrant-4 Monitor Closely
SDK Compat: [0.2, 0.18]
Protocol Chg: [0.75, 0.72]
Perf Overhead: [0.58, 0.42]
Context Prop: [0.4, 0.55]
Memory Leaks: [0.85, 0.25]
Risk Details
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Protocol changes break compatibility | Medium | High | Use high field numbers, optional fields |
| Performance overhead unacceptable | Medium | Medium | Sampling, conditional compilation |
| Context propagation complexity | Medium | Medium | Phased rollout, extensive testing |
| SDK compatibility issues | Low | Medium | Pin SDK version, fallback to no-op |
| Memory leaks in long-running nodes | Low | High | Memory profiling, bounded queues |
6.10 Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Trace coverage | >95% of transaction code paths (independent of sampling ratio) | Sampling verification |
| CPU overhead | <3% | Benchmark tests |
| Memory overhead | <10 MB | Memory profiling |
| Latency impact (p99) | <2% | Performance tests |
| Trace completeness | >99% spans with required attrs | Validation script |
| Cross-node trace linkage | >90% of multi-hop transactions | Integration tests |
6.9 Quick Wins and Crawl-Walk-Run Strategy
TxQ = Transaction Queue
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
6.9.1 Crawl-Walk-Run Overview
flowchart TB
subgraph crawl["🐢 CRAWL (Week 1-2)"]
direction LR
c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[PathFinding + TxQ Tracing] ~~~ c4[Single Node]
end
subgraph walk["🚶 WALK (Week 3-5)"]
direction LR
w1[Transaction Tracing] ~~~ w2[Fee Escalation Tracing] ~~~ w3[Cross-Node Context] ~~~ w4[Basic Dashboards]
end
subgraph run["🏃 RUN (Week 6-9)"]
direction LR
r1[Consensus Tracing] ~~~ r2[Establish Phase<br/>& Cross-Node Correlation] ~~~ r3[StatsD Integration] ~~~ r4[Production Deploy]
end
crawl --> walk --> run
style crawl fill:#1b5e20,stroke:#0d3d14,color:#fff
style walk fill:#bf360c,stroke:#8c2809,color:#fff
style run fill:#0d47a1,stroke:#082f6a,color:#fff
style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff
style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff
style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff
style c4 fill:#1b5e20,stroke:#0d3d14,color:#fff
style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style w4 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style r1 fill:#0d47a1,stroke:#082f6a,color:#fff
style r2 fill:#0d47a1,stroke:#082f6a,color:#fff
style r3 fill:#0d47a1,stroke:#082f6a,color:#fff
style r4 fill:#0d47a1,stroke:#082f6a,color:#fff
Reading the diagram:
- CRAWL (Weeks 1-2): Minimal investment -- set up the SDK, instrument RPC and PathFinding/TxQ handlers, and verify on a single node. Delivers immediate latency visibility.
- WALK (Weeks 3-5): Expand to transaction lifecycle tracing, fee escalation, cross-node context propagation, and basic Grafana dashboards. This is where distributed tracing starts working.
- RUN (Weeks 6-9): Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting.
- Arrows (crawl → walk → run): Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
6.9.2 Quick Wins (Immediate Value)
| Quick Win | Value | When to Deploy |
|---|---|---|
| RPC Command Tracing | High | Week 2 |
| RPC Latency Histograms | High | Week 2 |
| Error Rate Dashboard | Medium | Week 2 |
| Transaction Submit Tracing | High | Week 3 |
| Consensus Round Duration | Medium | Week 6 |
6.9.3 CRAWL Phase (Weeks 1-2)
Goal: Get basic tracing working with minimal code changes.
What You Get:
- RPC request/response traces for all commands
- Latency breakdown per RPC command
- PathFinding and TxQ tracing (directly impacts RPC latency)
- Error visibility with stack traces
- Basic Grafana dashboard
Code Changes: ~15 lines in ServerHandler.cpp, ~40 lines in new telemetry module
Why Start Here:
- RPC is the lowest-risk, highest-visibility component
- PathFinding and TxQ are RPC-adjacent and directly affect latency
- Immediate value for debugging client issues
- No cross-node complexity
- Single file modification to existing code
6.9.4 WALK Phase (Weeks 3-5)
Goal: Add transaction lifecycle tracing across nodes.
What You Get:
- End-to-end transaction traces from submit to relay
- Fee escalation tracing within the transaction pipeline
- Cross-node correlation (see transaction path)
- HashRouter deduplication visibility
- Relay latency metrics
Code Changes: ~120 lines across 4 files, plus protobuf extension
Why Do This Second:
- Builds on RPC tracing (transactions submitted via RPC)
- Fee escalation is integral to the transaction processing pipeline
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues
6.9.5 RUN Phase (Weeks 6-9)
Goal: Full observability including consensus.
What You Get:
- Complete consensus round visibility
- Phase transition timing
- Validator proposal tracking
Validator list and manifest tracing— descopedAmendment voting tracing— descopedSHAMap sync tracing— descoped- Full end-to-end traces (client → RPC → TX → consensus → ledger) — partial (tx-consensus correlation not yet done)
Code Changes: ~100 lines across 3 consensus files
Why Do This Last:
- Highest complexity (consensus is critical path)
- Validator, amendment, and SHAMap components were descoped (lower priority)
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
6.9.6 ROI Prioritization Matrix
quadrantChart
title Implementation ROI Matrix
x-axis Low Effort --> High Effort
y-axis Low Value --> High Value
quadrant-1 Quick Wins - Do First
quadrant-2 Major Projects - Plan Carefully
quadrant-3 Nice to Have - Optional
quadrant-4 Time Sinks - Avoid
RPC Tracing: [0.15, 0.92]
TX Submit Trace: [0.3, 0.78]
TX Relay Trace: [0.5, 0.88]
Consensus Trace: [0.72, 0.72]
Peer Msg Trace: [0.85, 0.3]
Ledger Acquire: [0.55, 0.52]
6.13 Definition of Done
TxQ = Transaction Queue | HA = High Availability
Clear, measurable criteria for each phase.
6.13.1 Phase 1: Core Infrastructure
| Criterion | Measurement | Target |
|---|---|---|
| SDK Integration | cmake --build succeeds with -DXRPL_ENABLE_TELEMETRY=ON |
✅ Compiles |
| Runtime Toggle | enabled=0 produces zero overhead |
<0.1% CPU difference |
| Span Creation | Unit test creates and exports span | Span appears in Tempo |
| Configuration | All config options parsed correctly | Config validation tests pass |
| Documentation | Developer guide exists | PR approved |
Definition of Done: All criteria met, PR merged, no regressions in CI.
6.13.2 Phase 2: RPC Tracing
| Criterion | Measurement | Target |
|---|---|---|
| Coverage | All RPC commands instrumented | 100% of commands |
| Context Extraction | traceparent header propagates | Integration test passes |
| Attributes | Command, status, duration recorded | Validation script confirms |
| Performance | RPC latency overhead | <1ms p99 |
| Dashboard | Grafana dashboard deployed | Screenshot in docs |
Definition of Done: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.
6.13.3 Phase 3: Transaction Tracing
| Criterion | Measurement | Target |
|---|---|---|
| Local Trace | Submit → validate → TxQ traced | Single-node test passes |
| Cross-Node | Context propagates via protobuf | Multi-node test passes |
| Deterministic TraceID | Same trace_id on all nodes for same tx | Multi-node test: query by txHash[0:16] returns all spans |
| Relay Ordering | Protobuf span_id propagation creates parent-child | Tempo trace tree shows relay chain |
| Graceful Degradation | Old peer drops trace_context | Spans still grouped by deterministic trace_id |
| Relay Visibility | relay_count attribute correct | Spot check 100 txs |
| HashRouter | Deduplication visible in trace | Duplicate txs show suppressed=true |
| Performance | TX throughput overhead | <5% degradation |
Definition of Done: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds.
6.13.4 Phase 4: Consensus Tracing
| Criterion | Measurement | Target |
|---|---|---|
| Round Tracing | startRound creates root span | Unit test passes |
| Phase Visibility | All phases have child spans | Integration test confirms |
| Proposer Attribution | Proposer ID in attributes | Spot check 50 rounds |
| Timing Accuracy | Phase durations match PerfLog | <5% variance |
| No Consensus Impact | Round timing unchanged | Performance test passes |
Definition of Done: Consensus rounds fully traceable, no impact on consensus timing.
6.13.5 Phase 5: Production Deployment
| Criterion | Measurement | Target |
|---|---|---|
| Collector HA | Multiple collectors deployed | No single point of failure |
| Sampling | Tail sampling configured | 10% base + errors + slow |
| Retention | Data retained per policy | 7 days hot, 30 days warm |
| Alerting | Alerts configured | Error spike, high latency |
| Runbook | Operator documentation | Approved by ops team |
| Training | Team trained | Session completed |
Definition of Done: Telemetry running in production, operators trained, alerts active.
6.13.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline | Status |
|---|---|---|---|---|
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | Active |
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | Active |
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | Active |
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | Active |
| Phase 5 | Production deployment | Operators trained | End of Week 9 | Active |
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | Active |
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | Active |
| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | Active |
| Phase 9 | 68+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement |
| Phase 10 | Full telemetry stack validated; OTel-sourced regression gate in CI | < 3% CPU overhead proven | End of Week 17 | Future Enhancement |
| Phase 11 | Third-party metrics via receiver | 4 new dashboards + alerting | End of Week 20 | Future Enhancement |
6.13 Recommended Implementation Order
Based on ROI analysis, implement in this exact order:
flowchart TB
subgraph week1["Week 1"]
t1[1. OpenTelemetry SDK<br/>Conan/CMake integration]
t2[2. Telemetry interface<br/>SpanGuard, config]
end
subgraph week2["Week 2"]
t3[3. RPC ServerHandler<br/>instrumentation]
t4[4. Basic Tempo setup<br/>for testing]
end
subgraph week3["Week 3"]
t5[5. Transaction submit<br/>tracing]
t6[6. Grafana dashboard<br/>v1]
end
subgraph week4["Week 4"]
t7[7. Protobuf context<br/>extension]
t8[8. PeerImp tx.relay<br/>instrumentation]
end
subgraph week5["Week 5"]
t9[9. Multi-node<br/>integration tests]
t10[10. Performance<br/>benchmarks]
end
subgraph week6_8["Weeks 6-8"]
t11[11. Consensus<br/>instrumentation]
t12[12. Full integration<br/>testing]
end
subgraph week9["Week 9"]
t13[13. Production<br/>deployment]
t14[14. Documentation<br/>& training]
end
t1 --> t2 --> t3 --> t4
t4 --> t5 --> t6
t6 --> t7 --> t8
t8 --> t9 --> t10
t10 --> t11 --> t12
t12 --> t13 --> t14
style week1 fill:#1b5e20,stroke:#0d3d14,color:#fff
style week2 fill:#1b5e20,stroke:#0d3d14,color:#fff
style week3 fill:#bf360c,stroke:#8c2809,color:#fff
style week4 fill:#bf360c,stroke:#8c2809,color:#fff
style week5 fill:#bf360c,stroke:#8c2809,color:#fff
style week6_8 fill:#0d47a1,stroke:#082f6a,color:#fff
style week9 fill:#4a148c,stroke:#2e0d57,color:#fff
style t1 fill:#1b5e20,stroke:#0d3d14,color:#fff
style t2 fill:#1b5e20,stroke:#0d3d14,color:#fff
style t3 fill:#1b5e20,stroke:#0d3d14,color:#fff
style t4 fill:#1b5e20,stroke:#0d3d14,color:#fff
style t5 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t6 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t7 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t8 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t9 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t10 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
style t11 fill:#0d47a1,stroke:#082f6a,color:#fff
style t12 fill:#0d47a1,stroke:#082f6a,color:#fff
style t13 fill:#4a148c,stroke:#2e0d57,color:#fff
style t14 fill:#4a148c,stroke:#2e0d57,color:#fff
Reading the diagram:
- Week 1 (tasks 1-2): Foundation work -- integrate the OpenTelemetry SDK via Conan/CMake and build the
Telemetryinterface withSpanGuardand config parsing. - Week 2 (tasks 3-4): First observable output -- instrument
ServerHandlerfor RPC tracing and stand up Tempo so developers can see traces immediately. - Weeks 3-5 (tasks 5-10): Transaction lifecycle -- add submit tracing, build the first Grafana dashboard, extend protobuf for cross-node context, instrument
PeerImprelay, then validate with multi-node integration tests and performance benchmarks. - Weeks 6-8 (tasks 11-12): Consensus deep-dive -- instrument consensus rounds and phases, then run full integration testing across all instrumented paths.
- Week 9 (tasks 13-14): Go-live -- deploy to production with sampling/alerting configured, and deliver documentation and operator training.
- Arrow chain (t1 → ... → t14): Strict sequential dependency; each task's output is a prerequisite for the next.
Previous: Configuration Reference | Next: Observability Backends | Back to: Overview