Compare commits

...

145 Commits

Author SHA1 Message Date
Pratik Mankawde
289b049b70 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-03 17:25:22 +01:00
Pratik Mankawde
4e422a0354 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 17:25:22 +01:00
Pratik Mankawde
ac1805f0a4 feat(telemetry): add spanmetrics dimensions and dashboard panels for enriched attrs
Collector config: add tx_type, ter_result, txq_status, consensus_state,
load_type, is_batch as spanmetrics dimensions so they appear as
Prometheus labels for dashboard queries.

New dashboard panels:
- Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie),
  Transactor Duration p95 by Type
- Consensus Health: Outcome Distribution (pie), Failures Over Time
- RPC Performance: Resource Cost by Command, Batch vs Single

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:51:51 +01:00
Pratik Mankawde
365907ab22 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 16:40:22 +01:00
Pratik Mankawde
03fffec640 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-03 16:40:22 +01:00
Pratik Mankawde
a4bc7bd611 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 16:32:31 +01:00
Pratik Mankawde
c5bdaafc39 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-03 16:32:31 +01:00
Pratik Mankawde
ac79a5123e Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Resolve runbook conflict: keep both phase 6 ledger/peer span tables
AND new insights/sample queries section from the enrichment work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:22:20 +01:00
Pratik Mankawde
1b227a1eff docs(telemetry): update runbook with enriched attributes and sample queries
Adds comprehensive "Insights and Sample Queries" section showing operators
what questions they can answer with the newly-added span attributes:
- Transaction workflow analysis (filter by tx_type, fee, ter_result)
- TxQ health (txq_status, ledger_changed)
- RPC debugging (is_batch, request_payload_size, load_type)
- PathFinding performance (dest_currency, num_source_assets)
- Consensus health (consensus_state, is_bow_out, disputes_count)
- Cross-subsystem correlation examples

Also updates all span reference tables with the new attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:18:43 +01:00
Pratik Mankawde
b0e9e1a24d Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 16:16:53 +01:00
Pratik Mankawde
1162b6f3bc Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-01 16:00:14 +01:00
Pratik Mankawde
0bcc7635ac Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-01 16:00:00 +01:00
Pratik Mankawde
f51b113f4b Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-01 14:46:22 +01:00
Pratik Mankawde
7cf55315b5 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-01 14:45:57 +01:00
Pratik Mankawde
ce6a3153a1 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-01 11:49:43 +01:00
Pratik Mankawde
3115313551 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-01 11:49:30 +01:00
Pratik Mankawde
280217653d compilation fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:38:58 +01:00
Pratik Mankawde
d7579b2861 formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:21:00 +01:00
Pratik Mankawde
8d730b8b9a Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:16:35 +01:00
Pratik Mankawde
e5fae351d6 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-29 17:53:29 +01:00
Pratik Mankawde
a844c14e49 merge: pratik/otel-phase5-docs-deployment (line-number + docs cleanup) into pratik/otel-phase6-statsd 2026-05-14 17:00:05 +01:00
Pratik Mankawde
c3c980e858 merge: pratik/otel-phase4-consensus-tracing (line-number + docs cleanup) into pratik/otel-phase5-docs-deployment 2026-05-14 17:00:02 +01:00
Pratik Mankawde
1a36ef4b0f fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter
Follow-up to the phase-6 dashboard cleanup. The three dashboards
introduced by commit f6105ece98 (consensus-health, rpc-performance,
transaction-overview) were missed in the initial UID rename and still
carried `rippled-*` UIDs plus line-number refs in panel descriptions.

- UIDs: `rippled-consensus` -> `xrpld-consensus`,
  `rippled-rpc-perf` -> `xrpld-rpc-perf`,
  `rippled-transactions` -> `xrpld-transactions`, matching the
  post-`docs.sh`-rename runbook and the other dashboards in this PR.
- Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`,
  `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers
  drift on every refactor; the filename is enough to grep.
- Fix the Overall RPC Throughput panel: two targets filtered on
  `span_name="rpc.request"` (never emitted) instead of
  `span_name="rpc.http_request"` (the real emitted name). The panel
  would have shown zero data until this fix.
2026-05-14 16:58:47 +01:00
Pratik Mankawde
a789f6ccf5 docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md
Follow-up to the dashboard cleanup on this branch. Caught additional sites
in TESTING.md that still reference the never-emitted `rpc.request` span:

- TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on
  `name="rpc.http_request"` (the real emitted name).
- Expected-spans table replaces `rpc.request` with `rpc.http_request`.
- Query loop under the Prometheus verification section now iterates over
  the full set of emitted RPC entry-point names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`).

Also drop `exporter=otlp_http` from the sample telemetry config block.
`TelemetryConfig.cpp` does not parse an `exporter` key in any phase through
Phase 8; only OTLP/HTTP is wired up, so the line is either a silently
ignored no-op or misleading documentation.
2026-05-14 16:53:40 +01:00
Pratik Mankawde
44cdc8133e fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers
Phase-6 introduces ledger-operations, peer-network, and the five StatsD
dashboards. Align them with the rest of the chain:

- Rename dashboard UIDs from `rippled-*` to `xrpld-*` so the provisioned
  UIDs match the post-rename-script documentation (`docs.sh` rewrites
  .md but not .json, so the two drifted). Runbook references
  `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches.
- Add the `$node` template variable + `exported_instance=~"$node"` filter
  to every target in the five `statsd-*` dashboards. Mirrors the pattern
  already used by consensus-health, ledger-operations, and peer-network
  per the project rule that every dashboard must support per-node
  filtering.
- Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references
  in every dashboard panel description and in docker/telemetry/TESTING.md.
  Line numbers drift on every refactor; the filename alone is enough to
  grep.
- Replace stale `rpc.request` entries with the real emitted span names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
  in TESTING.md so operators can copy-paste the filters and hit real
  traces.
- Also drop the `:706` line ref from the `StatsDCollector.cpp` callout
  in `06-implementation-phases.md`.
2026-05-14 16:51:14 +01:00
Pratik Mankawde
dfe91e071f merge: phase-5 (runbook span-name + line-number fixes) into phase-6
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	docs/telemetry-runbook.md
2026-05-14 16:42:13 +01:00
Pratik Mankawde
dec8b0a9a1 docs(telemetry): fix stale RPC span names + drop volatile line numbers in runbook
- RPC Spans table: `rpc.request` was documented but the code actually emits
  `rpc.http_request`. Listed the actual emitted names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
  and their parent/child relationship.
- Drop `:<line>` suffixes from Source File columns in both RPC and
  Transaction span tables. Line numbers drift with every refactor; the
  filename is enough for operators to grep.
- Summary table: replace the never-emitted `rpc.request` row with the real
  entry points so `span_name=` filters in PromQL / TraceQL match.
2026-05-14 16:34:58 +01:00
Pratik Mankawde
df1d8aed44 merge: phase-4 (phase-1a docs fixes) into phase-5 2026-05-14 16:24:36 +01:00
Pratik Mankawde
56090b0ead merge: pratik/otel-phase5-docs-deployment fix(SpanKind) into pratik/otel-phase6-statsd 2026-05-14 15:55:03 +01:00
Pratik Mankawde
6c6d6f953f merge: pratik/otel-phase4-consensus-tracing fix(SpanKind) into pratik/otel-phase5-docs-deployment 2026-05-14 15:55:01 +01:00
Pratik Mankawde
b449db0434 fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names
Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 14:01:12 +01:00
Pratik Mankawde
9babfff3c8 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-14 13:59:19 +01:00
Pratik Mankawde
68b32ed0f0 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-14 13:59:14 +01:00
Pratik Mankawde
fe7cb33b65 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:53:47 +01:00
Pratik Mankawde
f5cf4155c2 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:53:45 +01:00
Pratik Mankawde
b05e650b6f docs(telemetry): update 09-data-collection-reference + Phase5 integration test list for simplified attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:42:30 +01:00
Pratik Mankawde
57175ab12c Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:37:37 +01:00
Pratik Mankawde
d44a0aa3ff docs(telemetry): update Phase5 task list for simplified attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:37:27 +01:00
Pratik Mankawde
522fe562ff Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:36:34 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
e60efd4d2f Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:10:46 +01:00
Pratik Mankawde
c48f5ed6e7 docs(telemetry): update runbook attr names for simplified naming convention
Update 31 attribute references in telemetry-runbook.md to match the
simplified naming: drop xrpl.<domain>. prefix on per-span attrs, use
domain-qualified names for collisions (rpc_status, consensus_state,
etc.), and unify cross-domain refs (xrpl.ledger.seq, xrpl.tx.hash).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:08:48 +01:00
Pratik Mankawde
c9fe4b1a14 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:04:27 +01:00
Pratik Mankawde
580ee5ede7 fix(telemetry): StatsD gauge and io_latency first-sample emit
Two fixes so gauges register in Prometheus (via StatsD) even when their
initial/steady-state value is 0:

1. StatsDGaugeImpl m_dirty: default-init to true so the initial value
   (0) is emitted on the first flush. Previously, gauges whose value
   never changed from 0 were never flushed and never appeared
   downstream.

2. io_latency_sampler firstSample_: new atomic<bool>, init true.
   m_event.notify now fires when either firstSample_ is true (exchanged
   to false) or lastSample >= 10 ms. This guarantees the io_latency
   metric is registered on startup; subsequent sub-10 ms samples are
   still suppressed to avoid flooding.
2026-05-13 14:40:58 +01:00
Pratik Mankawde
beaf01ae4d fix(telemetry): fix CI failures in phase-6 build, clang-tidy, and rename checks
Build fixes in PeerImp.cpp:
- Rename duplicate `span` variable to `consSpan` in proposal and
  validation handlers to avoid redefinition error
- Fix `->` on non-pointer SpanGuard (now correctly on shared_ptr)
- Fix move-only type copy in lambda capture

Clang-tidy fixes:
- Concatenate nested namespaces in LedgerSpanNames.h and PeerSpanNames.h
- Add missing SpanNames.h includes in BuildLedger.cpp, LedgerMaster.cpp,
  PeerImp.cpp for direct seg:: symbol usage
- Add missing <chrono> and <cstdint> includes in BuildLedger.cpp
- Remove unused Feature.h include from BuildLedger.cpp

Rename check fix:
- Run docs.sh to rename rippled_ metric prefixes to xrpld_ in
  09-data-collection-reference.md and telemetry-runbook.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:09:17 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
51918ef868 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 19:58:54 +01:00
Pratik Mankawde
e6266e4e8d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 18:20:23 +01:00
Pratik Mankawde
025620cc4e Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 18:20:19 +01:00
Pratik Mankawde
3dd2f34591 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-04-29 17:38:03 +01:00
Pratik Mankawde
521e0756e1 docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:33:10 +01:00
Pratik Mankawde
f434706eec Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
2026-04-29 17:16:28 +01:00
Pratik Mankawde
8a54ef1600 docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:08:53 +01:00
Pratik Mankawde
612a32d047 feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:07:10 +01:00
Pratik Mankawde
7be06aaae0 fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:07:10 +01:00
Pratik Mankawde
70aa2b66dd fix: address PR review round 2 — event name constants, span timing
- Add cons_span::event namespace with disputeResolve and txIncluded
  constants; replace hardcoded strings in Consensus.h and RCLConsensus.cpp
- Move proposal.receive and validation.receive spans in PeerImp into
  shared_ptr captured by job lambdas so they measure checkPropose and
  checkValidation timing, not just message parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:07:09 +01:00
Pratik Mankawde
887b35821d code review changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:06:22 +01:00
Pratik Mankawde
faf9342695 docs(telemetry): mark Phase 4/4a consensus tracing tasks complete
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:06:22 +01:00
Pratik Mankawde
0a371dca7d feat(telemetry): complete Phase 4 consensus tracing
Implement remaining Phase 4/4a consensus tracing tasks:

- Add consensus.phase.open span (open → closeLedger lifecycle)
- Add consensus.proposal.receive span in PeerImp with trusted attr
- Add consensus.validation.receive span in PeerImp with trusted/seq attrs
- Add tx_count attr on accept.apply, disputes_count on update_positions
- Add tx.included events with txId in doAccept transaction loop
- Enhance dispute.resolve event with yays/nays fields
- Add avalanche_threshold attr on update_positions span
- Reparent accept/accept.apply as children of round span via childSpan()

Also adds compile-time constants in ConsensusSpanNames.h and updates
the span hierarchy diagram.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:06:22 +01:00
Pratik Mankawde
6c904a5593 docs update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
75191e472b fix(telemetry): remove duplicate hashSpan(4-arg) from rebase
The 4-arg hashSpan overload was duplicated during a prior rebase
cascade — it appeared at both line 240 and line 305 in SpanGuard.cpp.
This would cause a linker error (multiple definition).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
1e6d55bbce docs(telemetry): document hashSpan factory, ConsensusSpanNames.h, and API details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
86ef6ff2cf feat(telemetry): add avalanche threshold and close time consensus attributes
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:

- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
  threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
  reached close time consensus in this iteration

These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
6157624103 fix(telemetry): preserve deterministic trace_id in round spans
Remove the span-replacement logic in startRoundTracing() that was
discarding the hash-derived round span and replacing it with a linked
span (which gets a random trace_id). The deterministic trace_id from
the ledger hash is the key feature enabling cross-node correlation —
replacing it broke correlation on all rounds after the first.

Also: use thread_local mt19937 for hashSpan() span IDs (same fix as
phase-3 txSpan), add Doxygen to establish tracing method declarations
in Consensus.h, and update SpanGuard.h diagram with hashSpan/addEvent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
54c97daaf1 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
654fe2d30f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
0012f52940 fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
46af5bdc5a fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
a05ada89ec refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
8afe604aff fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:36 +01:00
Pratik Mankawde
417d7ec6d5 docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:36 +01:00
Pratik Mankawde
2918001602 fix(telemetry): use default_prng() for span IDs, fix non-telemetry build
Replace thread_local mt19937 with xrpl::default_prng() for span ID
generation — uses the project's existing thread-local xor-shift engine.
One call yields a uint64_t (8 bytes), filling the span ID in a single
memcpy without loops.

Fix compilation failure when XRPL_ENABLE_TELEMETRY is not defined:
move xrpl.pb.h include outside the #ifdef guard in TxTracing.h since
protocol::TMTransaction is used unconditionally in the function
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:36 +01:00
Pratik Mankawde
6a8053df2d fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
e2a7802945 refactor(telemetry): colocate SpanNames headers with their classes
Move TxSpanNames.h and TxQSpanNames.h from src/xrpld/telemetry/ to sit
next to the classes they instrument, matching the PathFindSpanNames.h
convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
7f0a8a7ed7 feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
39f690a751 docs(telemetry): add Task 3.10 TxQ instrumentation to Phase 3 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
d6ee6c6bbc feat(telemetry): add TxQ tracing with 6 spans (Tasks 3.9/3.10)
Instrument the transaction queue lifecycle with full span coverage:

- txq.enqueue: wraps TxQ::apply() enqueue/direct/reject decision
  with tx_hash attribute
- txq.apply_direct: wraps TxQ::tryDirectApply() fast-path
- txq.batch_clear: wraps TxQ::tryClearAccountQueueUpThruTx()
  batch clear on high-fee tx
- txq.accept: wraps TxQ::accept() ledger-close dequeue cycle
  with queue_size attribute
- txq.accept_tx: per-tx span inside accept loop with tx_hash,
  ter_code, retries_remaining attributes
- txq.cleanup: wraps TxQ::processClosedLedger() fee metric updates
  and tx expiration with ledger_seq attribute

New file: TxQSpanNames.h with compile-time constants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
4d3d15eda8 docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
8bed4bc95a refactor(telemetry): extract TX span name constants into TxSpanNames.h
Move scattered string literals from PeerImp.cpp and NetworkOPs.cpp into
compile-time constants in src/xrpld/telemetry/TxSpanNames.h. Follows
the same StaticStr/join() pattern established in Phase 1c for RPC spans.

Constants cover: span prefixes (tx), operations (receive, process),
attribute keys (hash, local, path, suppressed, status, peerId,
peerVersion), and values (sync, async, knownBad).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
92072d0304 docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
94005ca0e4 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
780cc434a7 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
39273e3aae Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docs/telemetry-runbook.md
2026-04-29 14:30:13 +01:00
Pratik Mankawde
9f571e5d1e docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:28:40 +01:00
Pratik Mankawde
dc3cfc325c Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 14:24:56 +01:00
Pratik Mankawde
ac11217195 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	include/xrpl/telemetry/TraceContextPropagator.h
2026-04-29 14:24:38 +01:00
Pratik Mankawde
103dd605d2 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
# Conflicts:
#	include/xrpl/telemetry/SpanGuard.h
#	src/xrpld/overlay/detail/PeerImp.cpp
2026-04-29 14:23:31 +01:00
Pratik Mankawde
12b7316f71 feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:21:32 +01:00
Pratik Mankawde
b933e8ae00 feat(telemetry): add missing StatsD dashboard panels from production dashboard
Compared shared production Grafana dashboard against Phase 6 StatsD
dashboards and added 10 missing panels covering job execution/dequeue
timers, cache metrics, ledger publish gap, state duration rate, duplicate
traffic, and detailed traffic breakdown.

Node Health dashboard: 8 → 16 panels, plus quantile template variable.
Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate().
Updated runbook, data collection reference, and implementation phases docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:02:27 +01:00
Pratik Mankawde
a1cb752745 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 13:01:38 +01:00
Pratik Mankawde
fb04271204 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 13:01:31 +01:00
Pratik Mankawde
35fb33438f Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 13:01:24 +01:00
Pratik Mankawde
36c4363c54 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 13:01:19 +01:00
Pratik Mankawde
0dec657c61 fix(telemetry): rename dashboard provider to xrpld, replace Jaeger with Tempo troubleshooting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:00:40 +01:00
Pratik Mankawde
b7c9e5775e feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:00:39 +01:00
Pratik Mankawde
2aa8dbc2cb fix(telemetry): restore StatsD receiver, fix metric prefix and doc errors
The StatsD receiver config was lost during a branch rebase (--ours
conflict resolution dropped it). Re-add the statsd receiver to the
OTel Collector config and wire it into the metrics pipeline so
beast::insight UDP metrics flow to Prometheus.

Also fixes:
- Metric prefix mismatch: docs used xrpld_ but dashboards/tests use
  rippled_ — align all documentation to match the runnable stack
- Remove phantom Peer_Disconnects_Charges from docs (plain atomic,
  not a beast::insight gauge)
- Remove premature .codecov.yml exclusions for Phase 7 OTelCollector
  files that don't exist on this branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:57:50 +01:00
Pratik Mankawde
8daf09b3ce Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-04-29 12:37:06 +01:00
Pratik Mankawde
a3044bcef9 fix(telemetry): address review findings for docs/dashboards
- Add missing xrpl.consensus.quorum attribute to consensus.accept in runbook
- Fix dashboard legend formats: add exported_instance, use Title Case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:36:24 +01:00
Pratik Mankawde
3433c9583d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/otel-collector-config.yaml
#	docs/telemetry-runbook.md
2026-04-29 12:34:27 +01:00
Pratik Mankawde
a271744d42 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 12:31:07 +01:00
Pratik Mankawde
09c5f5c3bf Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 12:31:03 +01:00
Pratik Mankawde
b8d3c52017 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 12:30:59 +01:00
Pratik Mankawde
21dad9a17d docs(telemetry): sync runbook, dashboards, and configs with code
- Add 14 missing spans to runbook (6 TxQ + 8 consensus)
- Fix tx.receive attributes and config table in runbook
- Document dispute.resolve and tx.included span events
- Add spanmetrics dimensions for close_time_correct and tx.suppressed
- Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL
- Wire $consensus_mode template variable to all consensus panels
- Add 10 Tempo search filters for operational attributes
- Apply rename script artifacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:29:53 +01:00
Pratik Mankawde
1a96f75954 fix(telemetry): apply rename script to phase 6 documentation
Replace remaining rippled/Ripple references with xrpld/XRPL in
data collection reference, implementation phases, and runbook docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:30:50 +01:00
Pratik Mankawde
88e25119f0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 11:29:14 +01:00
Pratik Mankawde
c5a59645d9 fix(telemetry): resolve merge conflicts, bashate, and rename for phase 5
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:28:54 +01:00
Pratik Mankawde
c0a5f57cdf Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 11:24:05 +01:00
Pratik Mankawde
8e97c7329a fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:23:43 +01:00
Pratik Mankawde
fe058d49b4 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 11:21:35 +01:00
Pratik Mankawde
c01f8ae99c fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 18:14:00 +01:00
Pratik Mankawde
fb25d97077 fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 18:01:50 +01:00
Pratik Mankawde
d50e0ff48e fix: address PR review round 2 — event name constants, span timing
- Add cons_span::event namespace with disputeResolve and txIncluded
  constants; replace hardcoded strings in Consensus.h and RCLConsensus.cpp
- Move proposal.receive and validation.receive spans in PeerImp into
  shared_ptr captured by job lambdas so they measure checkPropose and
  checkValidation timing, not just message parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 17:58:06 +01:00
Pratik Mankawde
d990f7f197 code review changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-28 17:03:49 +01:00
Pratik Mankawde
1e4ce19556 docs(telemetry): mark Phase 4/4a consensus tracing tasks complete
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 16:17:06 +01:00
Pratik Mankawde
bc49eb6f83 feat(telemetry): complete Phase 4 consensus tracing
Implement remaining Phase 4/4a consensus tracing tasks:

- Add consensus.phase.open span (open → closeLedger lifecycle)
- Add consensus.proposal.receive span in PeerImp with trusted attr
- Add consensus.validation.receive span in PeerImp with trusted/seq attrs
- Add tx_count attr on accept.apply, disputes_count on update_positions
- Add tx.included events with txId in doAccept transaction loop
- Enhance dispute.resolve event with yays/nays fields
- Add avalanche_threshold attr on update_positions span
- Reparent accept/accept.apply as children of round span via childSpan()

Also adds compile-time constants in ConsensusSpanNames.h and updates
the span hierarchy diagram.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 16:16:53 +01:00
Pratik Mankawde
90c2321bb8 docs update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-28 15:33:45 +01:00
Pratik Mankawde
901b3e34f6 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-28 15:08:11 +01:00
Pratik Mankawde
908eb841bd Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-28 15:08:06 +01:00
Pratik Mankawde
128de625e2 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-28 15:08:01 +01:00
Pratik Mankawde
ebd84a2338 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
# Conflicts:
#	src/libxrpl/telemetry/SpanGuard.cpp
2026-04-28 15:07:54 +01:00
Pratik Mankawde
cb7ee2358d docs(telemetry): update data collection reference with complete span/attribute inventory
Update 09-data-collection-reference.md to reflect the full
implementation across all phases:

- Expand span inventory from 16 to 35 spans across 8 categories
  (RPC, PathFind, TX, TxQ, Consensus, Ledger, Peer, gRPC)
- Add complete attribute inventory (81 attributes)
- Add TxQ spans (6), PathFind spans (5), and all 10 consensus spans
- Document LedgerSpanNames.h and PeerSpanNames.h in file inventory
- Add close time analysis dashboard panels to dashboard reference
- Add $close_time_correct and $resolution_direction template variables
- Document toDisplayString(ConsensusMode) utility
- Fix section numbering (duplicate section 8)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
b54b17708f feat(telemetry): add close time analysis panels to consensus-health dashboard
Add 5 new panels to the consensus-health Grafana dashboard using Tempo
TraceQL queries against consensus.accept.apply span attributes:

- Close Time: Raw Proposals (Per Node) — each node's unrounded
  wall-clock close_time_self, reveals clock drift across validators
- Close Time: Effective / Quantized — the consensus-agreed close_time
  after rounding to resolution bins, written to ledger header
- Close Time Vote Bins & Resolution — number of distinct vote bins
  (close_time_vote_bins) and bin size (close_resolution_ms) on dual axes
- Close Time Resolution Direction — whether resolution increased
  (coarser), decreased (finer), or stayed unchanged
- Close Time Bin Distribution — bar chart showing how raw proposals
  distribute across quantized bins per round

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
cbbd6ebee2 feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.

Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
  attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards

Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
  overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
de7194011d fix(docs): apply rename scripts to telemetry deployment docs
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in TESTING.md, xrpld-telemetry.cfg, and
telemetry-runbook.md, fixing the check-rename CI failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
ae475793d5 docs(telemetry): mark Phase 5 deferred tasks and fix stale macro reference
Mark Tasks 5.3 (alert definitions) and 5.6 (training materials) as
"Deferred — post-MVP" in the implementation phases document to
accurately reflect current delivery scope. Add status column to the
Phase 5 task table.

Also fix stale reference to XRPL_TRACE_* macros in Phase 4a section —
the implementation uses SpanGuard factory methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
f6105ece98 feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.

- Add Grafana dashboards: RPC performance, transaction overview,
  consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
  prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
  (replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
  Application.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
360698d79d fix(telemetry): remove duplicate hashSpan(4-arg) from rebase
The 4-arg hashSpan overload was duplicated during a prior rebase
cascade — it appeared at both line 240 and line 305 in SpanGuard.cpp.
This would cause a linker error (multiple definition).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:26 +01:00
Pratik Mankawde
b136b80c13 docs(telemetry): document hashSpan factory, ConsensusSpanNames.h, and API details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
7e47c6303f feat(telemetry): add avalanche threshold and close time consensus attributes
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:

- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
  threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
  reached close time consensus in this iteration

These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
689e803cc7 fix(telemetry): preserve deterministic trace_id in round spans
Remove the span-replacement logic in startRoundTracing() that was
discarding the hash-derived round span and replacing it with a linked
span (which gets a random trace_id). The deterministic trace_id from
the ledger hash is the key feature enabling cross-node correlation —
replacing it broke correlation on all rounds after the first.

Also: use thread_local mt19937 for hashSpan() span IDs (same fix as
phase-3 txSpan), add Doxygen to establish tracing method declarations
in Consensus.h, and update SpanGuard.h diagram with hashSpan/addEvent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
34ee231d62 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
4f4b4dd199 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:30:31 +01:00
Pratik Mankawde
d87839230a fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
e2cb811bf7 docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
2bb0995ff8 fix(telemetry): use default_prng() for span IDs, fix non-telemetry build
Replace thread_local mt19937 with xrpl::default_prng() for span ID
generation — uses the project's existing thread-local xor-shift engine.
One call yields a uint64_t (8 bytes), filling the span ID in a single
memcpy without loops.

Fix compilation failure when XRPL_ENABLE_TELEMETRY is not defined:
move xrpl.pb.h include outside the #ifdef guard in TxTracing.h since
protocol::TMTransaction is used unconditionally in the function
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
793fe65a96 fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
737b0f5488 refactor(telemetry): colocate SpanNames headers with their classes
Move TxSpanNames.h and TxQSpanNames.h from src/xrpld/telemetry/ to sit
next to the classes they instrument, matching the PathFindSpanNames.h
convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
ded848075d feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
397c66cede docs(telemetry): add Task 3.10 TxQ instrumentation to Phase 3 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
2fb165cd54 feat(telemetry): add TxQ tracing with 6 spans (Tasks 3.9/3.10)
Instrument the transaction queue lifecycle with full span coverage:

- txq.enqueue: wraps TxQ::apply() enqueue/direct/reject decision
  with tx_hash attribute
- txq.apply_direct: wraps TxQ::tryDirectApply() fast-path
- txq.batch_clear: wraps TxQ::tryClearAccountQueueUpThruTx()
  batch clear on high-fee tx
- txq.accept: wraps TxQ::accept() ledger-close dequeue cycle
  with queue_size attribute
- txq.accept_tx: per-tx span inside accept loop with tx_hash,
  ter_code, retries_remaining attributes
- txq.cleanup: wraps TxQ::processClosedLedger() fee metric updates
  and tx expiration with ledger_seq attribute

New file: TxQSpanNames.h with compile-time constants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
c585d9b66c docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
79ed703bb2 refactor(telemetry): extract TX span name constants into TxSpanNames.h
Move scattered string literals from PeerImp.cpp and NetworkOPs.cpp into
compile-time constants in src/xrpld/telemetry/TxSpanNames.h. Follows
the same StaticStr/join() pattern established in Phase 1c for RPC spans.

Constants cover: span prefixes (tx), operations (receive, process),
attribute keys (hash, local, path, suppressed, status, peerId,
peerVersion), and values (sync, async, knownBad).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
441c88dfb1 docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
178bc916a8 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
19eead6955 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
41 changed files with 9095 additions and 101 deletions

View File

@@ -36,3 +36,6 @@ ignore:
- "src/tests/"
- "include/xrpl/beast/test/"
- "include/xrpl/beast/unit_test/"
# Telemetry modules — conditionally compiled behind XRPL_ENABLE_TELEMETRY,
# which is not enabled in coverage builds.
- "src/xrpld/telemetry/"

View File

@@ -1,11 +1,11 @@
## Renaming ripple(d) to xrpl(d)
In the initial phases of development of the XRPL, the open source codebase was
called "rippled" and it remains with that name even today. Today, over 1000
called "xrpld" and it remains with that name even today. Today, over 1000
nodes run the application, and code contributions have been submitted by
developers located around the world. The XRPL community is larger than ever.
In light of the decentralized and diversified nature of XRPL, we will rename any
references to `ripple` and `rippled` to `xrpl` and `xrpld`, when appropriate.
references to `ripple` and `xrpld` to `xrpl` and `xrpld`, when appropriate.
See [here](https://xls.xrpl.org/xls/XLS-0095-rename-rippled-to-xrpld.html) for
more information.
@@ -22,17 +22,17 @@ run from the repository root.
2. `.github/scripts/rename/copyright.sh`: This script will remove superfluous
copyright notices.
3. `.github/scripts/rename/cmake.sh`: This script will rename all CMake files
from `RippleXXX.cmake` or `RippledXXX.cmake` to `XrplXXX.cmake`, and any
references to `ripple` and `rippled` (with or without capital letters) to
from `RippleXXX.cmake` or `XrpldXXX.cmake` to `XrplXXX.cmake`, and any
references to `ripple` and `xrpld` (with or without capital letters) to
`xrpl` and `xrpld`, respectively. The name of the binary will remain as-is,
and will only be renamed to `xrpld` by a later script.
4. `.github/scripts/rename/binary.sh`: This script will rename the binary from
`rippled` to `xrpld`, and reverses the symlink so that `rippled` points to
`xrpld` to `xrpld`, and reverses the symlink so that `xrpld` points to
the `xrpld` binary.
5. `.github/scripts/rename/namespace.sh`: This script will rename the C++
namespaces from `ripple` to `xrpl`.
6. `.github/scripts/rename/config.sh`: This script will rename the config from
`rippled.cfg` to `xrpld.cfg`, and updating the code accordingly. The old
`xrpld.cfg` to `xrpld.cfg`, and updating the code accordingly. The old
filename will still be accepted.
7. `.github/scripts/rename/docs.sh`: This script will rename any lingering
references of `ripple(d)` to `xrpl(d)` in code, comments, and documentation.

View File

@@ -294,19 +294,90 @@ See [Phase4_taskList.md § Phase 4b](./Phase4_taskList.md) for full design.
### Tasks
| Task | Description |
| ---- | ----------------------------- |
| 5.1 | Operator runbook |
| 5.2 | Grafana dashboards |
| 5.3 | Alert definitions |
| 5.4 | Collector deployment examples |
| 5.5 | Developer documentation |
| 5.6 | Training materials |
| 5.7 | Final integration testing |
| Task | Description | Status |
| ---- | ----------------------------- | ------------------- |
| 5.1 | Operator runbook | Complete |
| 5.2 | Grafana dashboards | Complete |
| 5.3 | Alert definitions | Deferred post-MVP |
| 5.4 | Collector deployment examples | Complete |
| 5.5 | Developer documentation | Complete |
| 5.6 | Training materials | Deferred post-MVP |
| 5.7 | Final integration testing | Complete |
---
## 6.7 Risk Assessment
## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
**Objective**: Bridge xrpld's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
### Background
xrpld has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
### Metric Inventory
| Category | Group | Type | Count | Key Metrics |
| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
| Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode |
| Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` |
| Ledger Fetch | | Counter | 1 | `ledger_fetches` |
| Ledger History | `ledger.history` | Counter | 1 | `mismatch` |
| RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) |
| Job Queue | | Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` |
| Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` |
| Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` |
| Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category |
| Pathfinding | | Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) |
| I/O | | Event | 1 | `ios_latency` (histogram) |
| Resource Mgr | | Meter | 2 | `warn`, `drop` (rate counters) |
| Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` |
**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics)
### Tasks
| Task | Description |
| ---- | --------------------------------------------------------------------------------------------------------------- |
| 6.1 | **DEFERRED** Fix Meter wire format (`\|m` `\|c`) in StatsDCollector.cpp breaking change, tracked separately |
| 6.2 | Add `statsd` receiver to OTel Collector config |
| 6.3 | Expose UDP port 8125 in docker-compose.yml |
| 6.4 | Add `[insight]` config to integration test node configs |
| 6.5 | Create "Node Health" Grafana dashboard (16 panels) |
| 6.6 | Create "Network Traffic" Grafana dashboard (10 panels) |
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) |
| 6.8 | Update integration test to verify StatsD metrics in Prometheus |
| 6.9 | Update TESTING.md and telemetry-runbook.md |
### Wire Format Fix (Task 6.1) — DEFERRED
The `StatsDMeterImpl` in `StatsDCollector.cpp` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
**Status**: Deferred as a separate change this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
### New Grafana Dashboards
**Node Health** (`statsd-node-health.json`, uid: `xrpld-statsd-node-health`):
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches, Key Jobs Execution/Dequeue Time, FullBelowCache Size/Hit Rate, Ledger Publish Gap, State Duration Rate, All Jobs Detail
**Network Traffic** (`statsd-network-traffic.json`, uid: `xrpld-statsd-network`):
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories, Duplicate Traffic, All Traffic Categories Detail
**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `xrpld-statsd-rpc`):
- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
### Exit Criteria
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=xrpld_LedgerMaster_Validated_Ledger_Age`)
- [ ] All 3 new Grafana dashboards load without errors
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ DEFERRED (breaking change, tracked separately)
---
## 6.8 Risk Assessment
```mermaid
quadrantChart
@@ -337,7 +408,7 @@ quadrantChart
---
## 6.8 Success Metrics
## 6.9 Success Metrics
| Metric | Target | Measurement |
| ------------------------ | -------------------------------------------------------------- | --------------------- |
@@ -350,13 +421,13 @@ quadrantChart
---
## 6.9 Quick Wins and Crawl-Walk-Run Strategy
## 6.10 Quick Wins and Crawl-Walk-Run Strategy
> **TxQ** = Transaction Queue
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
### 6.9.1 Crawl-Walk-Run Overview
### 6.10.1 Crawl-Walk-Run Overview
<div align="center">
@@ -405,7 +476,7 @@ flowchart TB
- **RUN (Weeks 6-9)**: Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting.
- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
### 6.9.2 Quick Wins (Immediate Value)
### 6.10.2 Quick Wins (Immediate Value)
| Quick Win | Value | When to Deploy |
| ------------------------------ | ------ | -------------- |
@@ -415,7 +486,7 @@ flowchart TB
| **Transaction Submit Tracing** | High | Week 3 |
| **Consensus Round Duration** | Medium | Week 6 |
### 6.9.3 CRAWL Phase (Weeks 1-2)
### 6.10.3 CRAWL Phase (Weeks 1-2)
**Goal**: Get basic tracing working with minimal code changes.
@@ -437,7 +508,7 @@ flowchart TB
- No cross-node complexity
- Single file modification to existing code
### 6.9.4 WALK Phase (Weeks 3-5)
### 6.10.4 WALK Phase (Weeks 3-5)
**Goal**: Add transaction lifecycle tracing across nodes.
@@ -458,7 +529,7 @@ flowchart TB
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues
### 6.9.5 RUN Phase (Weeks 6-9)
### 6.10.5 RUN Phase (Weeks 6-9)
**Goal**: Full observability including consensus.
@@ -481,7 +552,7 @@ flowchart TB
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
### 6.9.6 ROI Prioritization Matrix
### 6.10.6 ROI Prioritization Matrix
```mermaid
quadrantChart
@@ -503,13 +574,13 @@ quadrantChart
---
## 6.10 Definition of Done
## 6.11 Definition of Done
> **TxQ** = Transaction Queue | **HA** = High Availability
Clear, measurable criteria for each phase.
### 6.10.1 Phase 1: Core Infrastructure
### 6.11.1 Phase 1: Core Infrastructure
| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -521,7 +592,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
### 6.10.2 Phase 2: RPC Tracing
### 6.11.2 Phase 2: RPC Tracing
| Criterion | Measurement | Target |
| ------------------ | ---------------------------------- | -------------------------- |
@@ -533,7 +604,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.
### 6.10.3 Phase 3: Transaction Tracing
### 6.11.3 Phase 3: Transaction Tracing
| Criterion | Measurement | Target |
| --------------------- | ------------------------------------------------- | -------------------------------------------------------- |
@@ -548,7 +619,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds.
### 6.10.4 Phase 4: Consensus Tracing
### 6.11.4 Phase 4: Consensus Tracing
| Criterion | Measurement | Target |
| -------------------- | ----------------------------- | ------------------------- |
@@ -560,7 +631,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
### 6.10.5 Phase 5: Production Deployment
### 6.11.5 Phase 5: Production Deployment
| Criterion | Measurement | Target |
| ------------ | ---------------------------- | -------------------------- |
@@ -573,7 +644,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
### 6.10.6 Success Metrics Summary
### 6.11.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline |
| ------- | ---------------------- | --------------------------- | ------------- |
@@ -585,7 +656,7 @@ Clear, measurable criteria for each phase.
---
## 6.11 Recommended Implementation Order
## 6.12 Recommended Implementation Order
Based on ROI analysis, implement in this exact order:

View File

@@ -170,31 +170,33 @@ flowchart TB
### Plan Documents
| Document | Description |
| ---------------------------------------------------------------- | -------------------------------------------------- |
| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Distributed tracing concepts and OTel primer |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | xrpld architecture and trace points |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis |
| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components |
| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config, CMake, Collector configs |
| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics |
| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture |
| [08-appendix.md](./08-appendix.md) | Glossary, references, version history |
| [secure-OTel.md](./secure-OTel.md) | Threat model and hardening (mTLS, peer validation) |
| [presentation.md](./presentation.md) | Slide deck for OTel plan overview |
| Document | Description |
| -------------------------------------------------------------------- | -------------------------------------------------- |
| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Distributed tracing concepts and OTel primer |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | xrpld architecture and trace points |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis |
| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components |
| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config, CMake, Collector configs |
| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics |
| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture |
| [08-appendix.md](./08-appendix.md) | Glossary, references, version history |
| [secure-OTel.md](./secure-OTel.md) | Threat model and hardening (mTLS, peer validation) |
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Span/metric/dashboard inventory |
| [presentation.md](./presentation.md) | Slide deck for OTel plan overview |
### Task Lists
| Document | Description |
| ------------------------------------------ | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
| Document | Description |
| -------------------------------------------------------------------------- | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
---

View File

@@ -0,0 +1,751 @@
# Observability Data Collection Reference
> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by xrpld's observability stack.
>
> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples)
## Data Flow Overview
```mermaid
graph LR
subgraph xrpldNode["xrpld Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
B["beast::insight<br/>StatsD metrics<br/>(UDP sender)"]
end
subgraph collector["OTel Collector :4317 / :4318 / :8125"]
direction TB
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
R2["StatsD Receiver<br/>:8125 UDP"]
BP["Batch Processor<br/>timeout 1s, batch 100"]
SM["SpanMetrics Connector<br/>derives RED metrics<br/>from trace spans"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backend"]
D["Grafana Tempo :3200<br/>TraceQL search &<br/>S3/GCS long-term storage"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + StatsD metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000<br/>10 dashboards"]
end
A -->|"OTLP/HTTP :4318<br/>(traces + attributes)"| R1
B -->|"UDP :8125<br/>(gauges, counters, timers)"| R2
BP -->|"OTLP/gRPC :4317"| D
SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
R2 -->|"xrpld_* gauges<br/>xrpld_* counters<br/>xrpld_* summaries"| E
E -->|"Prometheus<br/>data source"| F
D -->|"Tempo<br/>data source"| F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#d9534f,color:#fff,stroke:#b52d2d
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
```
There are two independent telemetry pipelines entering a single **OTel Collector**:
1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus.
**Trace backend** — The collector exports traces via OTLP/gRPC to:
- **Grafana Tempo** — Preferred trace backend. Supports TraceQL queries at `:3200`, S3/GCS object storage for cost-effective long-term trace retention, and integrates natively with Grafana.
> **Further reading**: [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) for core OpenTelemetry concepts (traces, spans, context propagation, sampling). [07-observability-backends.md](./07-observability-backends.md) for production backend selection, collector placement, and sampling strategies.
---
## 1. OpenTelemetry Spans
### 1.1 Complete Span Inventory (35 spans)
> **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams.
#### RPC Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| -------------------- | ------------------ | ----------------- | ------------------------------------------------------------------------ |
| `rpc.http_request` | — | ServerHandler.cpp | Top-level HTTP RPC request entry point |
| `rpc.process` | `rpc.http_request` | ServerHandler.cpp | RPC processing pipeline |
| `rpc.ws_message` | — | ServerHandler.cpp | WebSocket message handling |
| `rpc.ws_upgrade` | — | ServerHandler.cpp | WebSocket upgrade handshake (error path) |
| `rpc.command.<name>` | `rpc.process` | RPCHandler.cpp | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"rpc.http_request|rpc.command.*"}`
**Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)
#### Transaction Spans
Controlled by `trace_transactions=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------ | -------------- | --------------- | ----------------------------------------------------------------- |
| `tx.process` | — | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) |
| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"tx.process|tx.receive"}`
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
#### PathFind Spans
Controlled by `trace_rpc=1` in `[telemetry]` config (pathfinding spans fire within RPC request handling).
| Span Name | Parent | Source File | Description |
| --------------------- | ------------------ | ---------------- | -------------------------------------------------------- |
| `pathfind.request` | `rpc.command.*` | PathRequests.cpp | RPC entry for path_find / ripple_path_find |
| `pathfind.compute` | `pathfind.request` | PathRequest.cpp | Single path computation (doUpdate) |
| `pathfind.update_all` | — | PathRequests.cpp | Async recomputation of all active path requests on close |
| `pathfind.discover` | `pathfind.compute` | Pathfinder.cpp | Graph exploration phase (Pathfinder::find) |
| `pathfind.rank` | `pathfind.compute` | Pathfinder.cpp | Path ranking and selection phase |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"pathfind.*"}`
**Grafana dashboard**: _RPC & Pathfinding (StatsD)_ (`xrpld-statsd-rpc`) for StatsD timers; span-derived metrics via _RPC Performance_ (`xrpld-rpc-perf`)
#### TxQ Spans
Controlled by `trace_transactions=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------------ | ------------- | ----------- | ---------------------------------------------------- |
| `txq.enqueue` | `tx.process` | TxQ.cpp | Queue admission decision (apply/queue/reject) |
| `txq.apply_direct` | `txq.enqueue` | TxQ.cpp | Direct application attempt (bypassing queue) |
| `txq.batch_clear` | `txq.enqueue` | TxQ.cpp | Batch clear of account's queued transactions |
| `txq.accept` | — | TxQ.cpp | Ledger-close accept loop (drain queued transactions) |
| `txq.accept.tx` | `txq.accept` | TxQ.cpp | Per-transaction apply within accept loop |
| `txq.cleanup` | — | TxQ.cpp | Post-close cleanup (expire old transactions) |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"txq.*"}`
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
#### gRPC Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| -------------- | ------ | -------------- | ----------------------------------------------------------------------------- |
| `grpc.request` | — | GRPCServer.cpp | Single gRPC request (GetLedger, GetLedgerData, GetLedgerDiff, GetLedgerEntry) |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name="grpc.request"}`
#### Consensus Spans
Controlled by `trace_consensus=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ---------------------------- | ----------------- | ---------------- | ----------------------------------------------------- |
| `consensus.round` | — | RCLConsensus.cpp | Top-level round span (deterministic trace ID) |
| `consensus.proposal.send` | `consensus.round` | RCLConsensus.cpp | Node broadcasts its transaction set proposal |
| `consensus.ledger_close` | `consensus.round` | RCLConsensus.cpp | Ledger close event triggered by consensus |
| `consensus.establish` | `consensus.round` | Consensus.h | Establish phase — convergence loop |
| `consensus.update_positions` | `consensus.round` | Consensus.h | Update positions during establish phase |
| `consensus.check` | `consensus.round` | Consensus.h | Check for consensus agreement |
| `consensus.accept` | `consensus.round` | RCLConsensus.cpp | Consensus accepts a ledger (round complete) |
| `consensus.accept.apply` | `consensus.round` | RCLConsensus.cpp | Ledger application with close time details |
| `consensus.validation.send` | `consensus.round` | RCLConsensus.cpp | Validation message sent after ledger accepted |
| `consensus.mode_change` | `consensus.round` | RCLConsensus.cpp | Consensus mode transition (e.g., tracking->proposing) |
> **Note**: `toDisplayString(ConsensusMode)` (in `ConsensusTypes.h`) provides Title Case display names for mode attribute values: `"Proposing"`, `"Observing"`, `"Wrong Ledger"`, `"Switched Ledger"`. This is separate from `to_string()` which returns stable log-format strings.
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"consensus.*"}`
**Grafana dashboard**: _Consensus Health_ (`xrpld-consensus`)
#### Ledger Spans
Controlled by `trace_ledger=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ----------------- | ------ | ---------------- | ---------------------------------------------- |
| `ledger.build` | — | BuildLedger.cpp | Build new ledger from accepted transaction set |
| `ledger.validate` | — | LedgerMaster.cpp | Ledger promoted to validated status |
| `ledger.store` | — | LedgerMaster.cpp | Ledger stored to database/history |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"ledger.*"}`
**Grafana dashboard**: _Ledger Operations_ (`xrpld-ledger-ops`)
#### Peer Spans
Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (high volume).
| Span Name | Parent | Source File | Description |
| ------------------------- | ------ | ----------- | ------------------------------------- |
| `peer.proposal.receive` | — | PeerImp.cpp | Consensus proposal received from peer |
| `peer.validation.receive` | — | PeerImp.cpp | Validation message received from peer |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"peer.*"}`
**Grafana dashboard**: _Peer Network_ (`xrpld-peer-net`)
---
### 1.2 Complete Attribute Inventory (81 attributes)
> **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations.
Every span can carry key-value attributes that provide context for filtering and aggregation.
#### RPC Attributes
| Attribute | Type | Set On | Description |
| ---------------------- | ------ | --------------- | ------------------------------------------------ |
| `command` | string | `rpc.command.*` | RPC command name (e.g., `server_info`, `ledger`) |
| `version` | int64 | `rpc.command.*` | API version number |
| `rpc_role` | string | `rpc.command.*` | Caller role: `"admin"` or `"user"` |
| `rpc_status` | string | `rpc.command.*` | Result: `"success"` or `"error"` |
| `request_payload_size` | int64 | `rpc.command.*` | Request payload size in bytes |
**Tempo query**: `{span.command="server_info"}` to find all `server_info` calls.
**Prometheus label**: `xrpl_rpc_command` (dots converted to underscores by SpanMetrics).
#### Transaction Attributes
| Attribute | Type | Set On | Description |
| ------------------- | ------- | -------------------------- | ---------------------------------------------------- |
| `xrpl.tx.hash` | string | `tx.process`, `tx.receive` | Transaction hash (hex-encoded) |
| `local` | boolean | `tx.process` | `true` if locally submitted, `false` if peer-relayed |
| `path` | string | `tx.process` | Submission path: `"sync"` or `"async"` |
| `suppressed` | boolean | `tx.receive` | `true` if transaction was suppressed (duplicate) |
| `tx_status` | string | `tx.receive` | Transaction status (e.g., `"known_bad"`) |
| `xrpl.peer.id` | int64 | `tx.receive` | Peer identifier (also set on peer spans) |
| `xrpl.peer.version` | string | `tx.receive` | Peer protocol version string |
**Tempo query**: `{span.xrpl.tx.hash="<hash>"}` to trace a specific transaction across nodes.
**Prometheus label**: `xrpl_tx_local` (used as SpanMetrics dimension).
#### PathFind Attributes
| Attribute | Type | Set On | Description |
| ---------------------------- | ------- | --------------------- | ----------------------------------------------- |
| `source_account` | string | `pathfind.request` | Source account address |
| `dest_account` | string | `pathfind.request` | Destination account address |
| `fast` | boolean | `pathfind.compute` | Whether this is a fast (non-full) pathfind |
| `search_level` | int64 | `pathfind.compute` | Search depth level |
| `num_complete_paths` | int64 | `pathfind.compute` | Number of complete paths found |
| `num_paths` | int64 | `pathfind.compute` | Total number of paths explored |
| `num_requests` | int64 | `pathfind.update_all` | Number of active path requests being recomputed |
| `xrpl.pathfind.ledger_index` | int64 | `pathfind.update_all` | Ledger index used for recomputation |
**Tempo query**: `{span.source_account="rHb9..."}` to find pathfind requests from a specific account.
#### TxQ Attributes
| Attribute | Type | Set On | Description |
| -------------------- | ------- | ------------------------------ | ---------------------------------------------------------- |
| `xrpl.tx.hash` | string | `txq.enqueue`, `txq.accept.tx` | Transaction hash in the queue |
| `txq_status` | string | `txq.enqueue` | Queue result: `"queued"`, `"applied_direct"`, `"rejected"` |
| `fee_level_paid` | int64 | `txq.enqueue` | Fee level paid by the transaction |
| `required_fee_level` | int64 | `txq.enqueue` | Minimum fee level required for queue admission |
| `queue_size` | int64 | `txq.accept` | Queue depth at start of accept |
| `ledger_changed` | boolean | `txq.accept` | Whether the open ledger changed since last accept |
| `xrpl.ledger.seq` | int64 | `txq.cleanup` | Ledger sequence for cleanup |
| `expired_count` | int64 | `txq.cleanup` | Number of expired transactions removed |
| `ter_code` | string | `txq.accept.tx` | Transaction engine result code |
| `retries_remaining` | int64 | `txq.accept.tx` | Remaining retry attempts for this transaction |
| `num_cleared` | int64 | `txq.batch_clear` | Number of transactions cleared in batch |
**Tempo query**: `{span.txq_status="rejected"}` to find rejected queue attempts.
#### gRPC Attributes
| Attribute | Type | Set On | Description |
| ------------ | ------ | -------------- | ------------------------------------------------------------ |
| `method` | string | `grpc.request` | gRPC method name (e.g., `GetLedger`, `GetLedgerData`) |
| `rpc_role` | string | `grpc.request` | Caller role: `"admin"` or `"user"` |
| `rpc_status` | string | `grpc.request` | Result: `"success"`, `"error"`, `"resource_exhausted"`, etc. |
**Tempo query**: `{span.method="GetLedger"}` to find gRPC ledger requests.
#### Consensus Attributes
| Attribute | Type | Set On | Description |
| --------------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| `xrpl.consensus.ledger_id` | string | `consensus.round` | Previous ledger hash (used for deterministic trace ID) |
| `xrpl.ledger.seq` | int64 | `consensus.round`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply` | Ledger sequence number |
| `xrpl.consensus.mode` | string | `consensus.round`, `consensus.proposal.send`, `consensus.ledger_close` | Node mode via `toDisplayString()`: `"Proposing"`, `"Observing"`, etc. |
| `xrpl.consensus.round` | int64 | `consensus.proposal.send` | Consensus round number |
| `proposers` | int64 | `consensus.proposal.send`, `consensus.accept` | Number of proposers in the round |
| `round_time_ms` | int64 | `consensus.accept`, `consensus.accept.apply` | Total consensus round duration in milliseconds |
| `proposing` | boolean | `consensus.validation.send` | Whether this node was a proposer |
| `consensus_state` | string | `consensus.accept.apply` | Consensus outcome: `"finished"` or `"moved_on"` |
| `close_time` | int64 | `consensus.accept.apply` | Agreed-upon ledger close time (epoch seconds) |
| `close_time_correct` | boolean | `consensus.accept.apply` | Whether validators reached agreement on close time |
| `close_resolution_ms` | int64 | `consensus.accept.apply` | Close time rounding granularity in milliseconds |
| `parent_close_time` | int64 | `consensus.accept.apply` | Parent ledger's close time (epoch seconds) |
| `close_time_self` | int64 | `consensus.accept.apply` | This node's proposed close time |
| `close_time_vote_bins` | string | `consensus.accept.apply` | Histogram of close time votes from validators |
| `resolution_direction` | string | `consensus.accept.apply` | Resolution change: `"increased"`, `"decreased"`, or `"unchanged"` |
| `converge_percent` | int64 | `consensus.establish` | Convergence percentage threshold |
| `establish_count` | int64 | `consensus.establish` | Number of establish iterations completed |
| `proposers_agreed` | int64 | `consensus.establish` | Number of proposers that agreed on this round |
| `avalanche_threshold` | int64 | `consensus.update_positions` | Avalanche threshold for dispute resolution |
| `close_time_threshold` | int64 | `consensus.update_positions` | Close time agreement threshold |
| `have_close_time_consensus` | boolean | `consensus.update_positions` | Whether close time consensus has been reached |
| `agree_count` | int64 | `consensus.check` | Number of proposers that agree with our position |
| `disagree_count` | int64 | `consensus.check` | Number of proposers that disagree with our position |
| `threshold_percent` | int64 | `consensus.check` | Required agreement threshold percentage |
| `consensus_result` | string | `consensus.check` | Check result: `"yes"`, `"no"`, or `"expired"` |
| `quorum` | int64 | `consensus.check` | Required quorum for validation |
| `validation_count` | int64 | `consensus.check` | Number of validations received |
| `trace_strategy` | string | `consensus.round` | Trace sampling strategy used for this round |
| `xrpl.consensus.round_id` | string | `consensus.round` | Deterministic round identifier |
| `xrpl.consensus.mode.old` | string | `consensus.mode_change` | Previous consensus mode |
| `xrpl.consensus.mode.new` | string | `consensus.mode_change` | New consensus mode |
| `xrpl.tx.id` | string | `consensus.update_positions` | Disputed transaction ID |
| `dispute_our_vote` | boolean | `consensus.update_positions` | Our vote on the disputed transaction |
| `dispute_yays` | int64 | `consensus.update_positions` | Number of proposers voting to include |
| `dispute_nays` | int64 | `consensus.update_positions` | Number of proposers voting to exclude |
**Tempo query**: `{span.xrpl.consensus.mode="Proposing"}` to find rounds where node was proposing.
**Prometheus label**: `xrpl_consensus_mode` (used as SpanMetrics dimension).
#### Ledger Attributes
| Attribute | Type | Set On | Description |
| --------------------- | ------- | ------------------------------------------------------------- | ------------------------------------------------ |
| `xrpl.ledger.seq` | int64 | `ledger.build`, `ledger.validate`, `ledger.store`, `tx.apply` | Ledger sequence number |
| `close_time` | int64 | `ledger.build` | Ledger close time (epoch seconds) |
| `close_time_correct` | boolean | `ledger.build` | Whether close time was agreed upon by validators |
| `close_resolution_ms` | int64 | `ledger.build` | Close time rounding granularity in milliseconds |
| `tx_count` | int64 | `ledger.build`, `tx.apply` | Transactions in the ledger |
| `tx_failed` | int64 | `ledger.build`, `tx.apply` | Failed transactions in the ledger |
| `validations` | int64 | `ledger.validate` | Number of validations received for this ledger |
**Tempo query**: `{span.xrpl.ledger.seq=12345}` to find all spans for a specific ledger.
#### Peer Attributes
| Attribute | Type | Set On | Description |
| -------------------- | ------- | ---------------------------------------------------------------- | ---------------------------------------------------- |
| `xrpl.peer.id` | int64 | `tx.receive`, `peer.proposal.receive`, `peer.validation.receive` | Peer identifier |
| `proposal_trusted` | boolean | `peer.proposal.receive` | Whether the proposal came from a trusted validator |
| `xrpl.ledger.hash` | string | `peer.validation.receive` | Ledger hash the validation refers to |
| `validation_full` | boolean | `peer.validation.receive` | Whether this is a full (not partial) validation |
| `validation_trusted` | boolean | `peer.validation.receive` | Whether the validation came from a trusted validator |
**Prometheus labels**: `xrpl_peer_proposal_trusted`, `xrpl_peer_validation_trusted` (SpanMetrics dimensions).
---
### 1.3 SpanMetrics — Derived Prometheus Metrics
> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights.
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in xrpld is needed.
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ------------------------------------------------------------------------------ |
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms) |
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Observation count |
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
**Standard labels on every metric**: `span_name`, `status_code`, `service_name`, `span_kind`
**Additional dimension labels** (configured in `otel-collector-config.yaml`):
| Span Attribute | Prometheus Label | Applies To |
| --------------------- | ------------------------------ | ------------------------- |
| `command` | `xrpl_rpc_command` | `rpc.command.*` |
| `rpc_status` | `xrpl_rpc_status` | `rpc.command.*` |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` |
| `local` | `xrpl_tx_local` | `tx.process` |
| `proposal_trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` |
| `validation_trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` |
**Where to query**: Prometheus → `traces_span_metrics_calls_total{span_name="rpc.command.server_info"}`
---
## 2. StatsD Metrics (beast::insight)
> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory.
These are system-level metrics emitted by xrpld's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans.
### Configuration
```ini
[insight]
server=statsd
address=127.0.0.1:8125
prefix=xrpld
```
> **Note**: The `prefix` value is user-configurable — all metric names in the tables below assume `prefix=xrpld` (matching the integration test and Grafana dashboards). If you change the prefix, replace `xrpld_` with `{your_prefix}_` in all PromQL queries.
### 2.1 Gauges
| Prometheus Metric | Source File | Description | Typical Range |
| ------------------------------------------------- | --------------------- | ---------------------------------------- | ------------------------------- |
| `xrpld_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h | Seconds since last validated ledger | 010 (healthy), >30 (stale) |
| `xrpld_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h | Seconds since last published ledger | 010 (healthy) |
| `xrpld_State_Accounting_Disconnected_duration` | NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic |
| `xrpld_State_Accounting_Connected_duration` | NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic |
| `xrpld_State_Accounting_Syncing_duration` | NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic |
| `xrpld_State_Accounting_Tracking_duration` | NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic |
| `xrpld_State_Accounting_Full_duration` | NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) |
| `xrpld_State_Accounting_Disconnected_transitions` | NetworkOPs.cpp | Count of transitions to Disconnected | Low |
| `xrpld_State_Accounting_Connected_transitions` | NetworkOPs.cpp | Count of transitions to Connected | Low |
| `xrpld_State_Accounting_Syncing_transitions` | NetworkOPs.cpp | Count of transitions to Syncing | Low |
| `xrpld_State_Accounting_Tracking_transitions` | NetworkOPs.cpp | Count of transitions to Tracking | Low |
| `xrpld_State_Accounting_Full_transitions` | NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) |
| `xrpld_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp | Active inbound peer connections | 085 |
| `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 1021 |
| `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
| `xrpld_job_count` | JobQueue.cpp | Current job queue depth | 0100 (healthy) |
| `xrpld_Node_family_full_below_cache_size` | TaggedCache.h | FullBelowCache entry count | Varies |
| `xrpld_Node_family_full_below_cache_hit_rate` | TaggedCache.h | FullBelowCache hit rate percentage | 0100 |
**Grafana dashboard**: _Node Health (StatsD)_ (`xrpld-statsd-node-health`)
### 2.2 Counters
| Prometheus Metric | Source File | Description |
| ------------------------------- | ------------------ | --------------------------------------------- |
| `xrpld_rpc_requests` | ServerHandler.cpp | Total RPC requests received |
| `xrpld_ledger_fetches` | InboundLedgers.cpp | Inbound ledger fetch attempts |
| `xrpld_ledger_history_mismatch` | LedgerHistory.cpp | Ledger hash mismatches detected |
| `xrpld_warn` | Logic.h | Resource manager warnings issued |
| `xrpld_drop` | Logic.h | Resource manager drops (connections rejected) |
**Note**: `xrpld_warn` and `xrpld_drop` use non-standard StatsD meter type (`|m`). The OTel StatsD receiver only recognizes `|c`, `|g`, `|ms`, `|h`, `|s` — these metrics may be silently dropped. See Known Issues below.
**Grafana dashboard**: _RPC & Pathfinding (StatsD)_ (`xrpld-statsd-rpc`)
### 2.3 Histograms (from StatsD timers)
| Prometheus Metric | Source File | Unit | Description |
| --------------------- | ----------------- | ----- | ------------------------------ |
| `xrpld_rpc_time` | ServerHandler.cpp | ms | RPC response time distribution |
| `xrpld_rpc_size` | ServerHandler.cpp | bytes | RPC response size distribution |
| `xrpld_ios_latency` | Application.cpp | ms | I/O service loop latency |
| `xrpld_pathfind_fast` | PathRequests.h | ms | Fast pathfinding duration |
| `xrpld_pathfind_full` | PathRequests.h | ms | Full pathfinding duration |
Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.
**Grafana dashboards**: _Node Health_ (`ios_latency`), _RPC & Pathfinding_ (`rpc_time`, `rpc_size`, `pathfind_*`)
### 2.4 Overlay Traffic Metrics
For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), four gauges are emitted:
- `xrpld_{category}_Bytes_In`
- `xrpld_{category}_Bytes_Out`
- `xrpld_{category}_Messages_In`
- `xrpld_{category}_Messages_Out`
**Key categories**:
| Category | Description |
| ----------------------------------------------------------------- | -------------------------- |
| `total` | All traffic aggregated |
| `overhead` / `overhead_overlay` | Protocol overhead |
| `transactions` / `transactions_duplicate` | Transaction relay |
| `proposals` / `proposals_untrusted` / `proposals_duplicate` | Consensus proposals |
| `validations` / `validations_untrusted` / `validations_duplicate` | Consensus validations |
| `ledger_data_get` / `ledger_data_share` | Ledger data exchange |
| `ledger_data_Transaction_Node_get/share` | Transaction node data |
| `ledger_data_Account_State_Node_get/share` | Account state node data |
| `ledger_data_Transaction_Set_candidate_get/share` | Transaction set candidates |
| `getObject` / `haveTxSet` / `ledgerData` | Object requests |
| `ping` / `status` | Keepalive and status |
| `set_get` | Set requests |
**Grafana dashboards**: _Network Traffic_ (`xrpld-statsd-network`), _Overlay Traffic Detail_ (`xrpld-statsd-overlay-detail`), _Ledger Data & Sync_ (`xrpld-statsd-ledger-sync`)
### 2.5 Per-Job Timer Events
For each of the 36 non-special job types (defined in `JobTypes.h`), two StatsD timer events are emitted:
- `xrpld_{jobName}` — execution duration
- `xrpld_{jobName}_q` — dequeue wait time
These produce summary metrics with quantiles (0th, 50th, 90th, 95th, 99th, 100th).
**Key job types** (most operationally relevant):
| Job Name | Source Enum | Description |
| ------------------- | ---------------- | ----------------------------- |
| `acceptLedger` | `jtACCEPT` | Consensus round acceptance |
| `advanceLedger` | `jtADVANCE` | Ledger advancement |
| `transaction` | `jtTRANSACTION` | Transaction processing |
| `writeObjects` | `jtWRITE` | Database object writes |
| `publishNewLedger` | `jtPUBLEDGER` | New ledger publication |
| `trustedValidation` | `jtVALIDATION_t` | Trusted validation processing |
| `trustedProposal` | `jtPROPOSAL_t` | Trusted proposal processing |
| `clientRPC` | `jtCLIENT_RPC` | Client RPC request handling |
| `heartbeat` | `jtNETOP_TIMER` | Network heartbeat timer |
| `sweep` | `jtSWEEP` | Cache sweep / cleanup |
| `ledgerData` | `jtLEDGER_DATA` | Ledger data processing |
Special job types (`limit=0`: `peerCommand`, `diskAccess`, `processTransaction`, `orderBookSetup`, `pathFind`, `nodeRead`, `nodeWrite`, `generic`, `SyncReadNode`, `AsyncReadNode`, `WriteNode`) do **not** emit timer events.
**Grafana dashboard**: _Node Health (StatsD)_ (`xrpld-statsd-node-health`) — Key Jobs and All Jobs panels
---
## 3. Grafana Dashboard Reference
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8 for Grafana data source provisioning (Tempo, Prometheus) and TraceQL query examples.
### 3.1 Span-Derived Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| -------------------- | -------------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| RPC Performance | `xrpld-rpc-perf` | Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands |
| Transaction Overview | `xrpld-transactions` | Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap |
| Consensus Health | `xrpld-consensus` | Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap, close time correctness, resolution direction, close time drift, resolution change timeline, close time vote distribution |
| Ledger Operations | `xrpld-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
| Peer Network | `xrpld-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
### 3.2 StatsD Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ----------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| Node Health | `xrpld-statsd-node-health` | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate, key/all jobs execution time, cache size/hit rate, publish gap, state duration rate |
| Network Traffic | `xrpld-statsd-network` | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category, duplicate traffic, all traffic categories detail |
| RPC & Pathfinding | `xrpld-statsd-rpc` | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
| Overlay Traffic Detail | `xrpld-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
| Ledger Data & Sync | `xrpld-statsd-ledger-sync` | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
### 3.3 Consensus Close-Time Panels
The Consensus Health dashboard includes 5 close-time panels added in Phase 4:
| Panel | Metric / Attribute | Description |
| ---------------------------- | --------------------------------- | ------------------------------------------------------------------------ |
| Close Time Correctness | `close_time_correct` | Percentage of rounds with agreed-upon close time |
| Resolution Direction | `resolution_direction` | Rate of resolution increases, decreases, and unchanged per time interval |
| Close Time Drift | `close_time` vs `close_time_self` | Difference between agreed close time and node's own proposed close time |
| Resolution Change Timeline | `close_resolution_ms` | Close time resolution granularity over time |
| Close Time Vote Distribution | `close_time_vote_bins` | Histogram of validator close time votes per round |
**Template variables** (Consensus Health dashboard):
| Variable | Source Attribute | Description |
| ----------------------- | ------------------------------------- | ------------------------------------------------------------------------ |
| `$node` | `exported_instance` | Filter by xrpld node instance |
| `$close_time_correct` | `xrpl_consensus_close_time_correct` | Filter by close time correctness (`true` / `false`) |
| `$resolution_direction` | `xrpl_consensus_resolution_direction` | Filter by resolution direction (`increased` / `decreased` / `unchanged`) |
### 3.4 Accessing the Dashboards
1. Open Grafana at **http://localhost:3000**
2. Navigate to **Dashboards → xrpld** folder
3. All 10 dashboards are auto-provisioned from `docker/telemetry/grafana/dashboards/`
---
## 4. Tempo Trace Search Guide
> **See also**: [08-appendix.md](./08-appendix.md) §8.2 for span hierarchy visualizations. [05-configuration-reference.md](./05-configuration-reference.md) §5.8.5 for TraceQL query examples.
### Finding Traces by Type
| What to Find | Tempo TraceQL Query |
| ------------------------ | ------------------------------------------------------------------------------ |
| All RPC calls | `{resource.service.name="xrpld" && name="rpc.http_request"}` |
| Specific RPC command | `{resource.service.name="xrpld" && name="rpc.command.server_info"}` |
| Slow RPC calls | `{resource.service.name="xrpld" && name=~"rpc.command.*"} \| duration > 100ms` |
| Failed RPC calls | `{span.rpc_status="error"}` |
| Specific transaction | `{span.xrpl.tx.hash="<hex_hash>"}` |
| Local transactions only | `{span.local=true}` |
| Consensus rounds | `{resource.service.name="xrpld" && name="consensus.accept"}` |
| Rounds by mode | `{span.xrpl.consensus.mode="proposing"}` |
| Specific ledger | `{span.xrpl.ledger.seq=12345}` |
| Peer proposals (trusted) | `{span.proposal_trusted=true}` |
### Trace Structure
A typical RPC trace shows the span hierarchy:
```
rpc.http_request (ServerHandler)
└── rpc.process (ServerHandler)
└── rpc.command.server_info (RPCHandler)
```
A consensus round groups child spans under a deterministic trace ID:
```
consensus.round (top-level, deterministic trace ID from ledger hash)
├── consensus.ledger_close (close event)
├── consensus.proposal.send (broadcast proposal)
├── consensus.establish (convergence loop)
│ ├── consensus.update_positions (update disputes)
│ └── consensus.check (check agreement)
├── consensus.accept (accept result)
├── consensus.accept.apply (apply with close time details)
├── consensus.validation.send (send validation)
└── consensus.mode_change (mode transition, if any)
ledger.build (build new ledger)
└── tx.apply (apply transaction set)
ledger.validate (promote to validated)
ledger.store (persist to DB)
```
---
## 5. Prometheus Query Examples
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics.
### Span-Derived Metrics
```promql
# RPC request rate by command (last 5 minutes)
sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))
# RPC p95 latency by command
histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))
# Consensus round duration p95
histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m])))
# Transaction processing rate (local vs relay)
sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))
# Trusted vs untrusted proposal rate
sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))
```
### StatsD Metrics
```promql
# Validated ledger age (should be < 10s)
xrpld_LedgerMaster_Validated_Ledger_Age
# Active peer count
xrpld_Peer_Finder_Active_Inbound_Peers + xrpld_Peer_Finder_Active_Outbound_Peers
# RPC response time p95
histogram_quantile(0.95, xrpld_rpc_time_bucket)
# Total network bytes in (rate)
rate(xrpld_total_Bytes_In[5m])
# Operating mode (should be "Full" after startup)
xrpld_State_Accounting_Full_duration
```
---
## 6. SpanNames Header File Inventory
All span names and attributes are defined as compile-time constants in colocated `SpanNames.h` headers. Each header lives next to its subsystem's implementation.
| Header File | Subsystem | Span Count | Attribute Count | Notes |
| ----------------------------------------------- | ------------- | ---------- | --------------- | ------------------------------------------- |
| `src/xrpld/rpc/detail/RpcSpanNames.h` | RPC (HTTP/WS) | 5 | 5 | Includes `rpc.ws_upgrade` error path |
| `src/xrpld/rpc/detail/PathFindSpanNames.h` | PathFind | 5 | 8 | Covers one-shot and subscription paths |
| `src/xrpld/app/main/GrpcSpanNames.h` | gRPC | 1 | 3 | Flat single-span structure per request |
| `src/xrpld/app/misc/TxSpanNames.h` | Transaction | 2 | 7 | Includes peer context attributes |
| `src/xrpld/app/misc/detail/TxQSpanNames.h` | TxQ | 6 | 11 | Queue lifecycle: enqueue through cleanup |
| `src/xrpld/app/consensus/ConsensusSpanNames.h` | Consensus | 10 | 35 | Deterministic trace IDs, close-time details |
| `src/xrpld/app/ledger/detail/LedgerSpanNames.h` | Ledger | 4 | 7 | Build, store, validate, tx.apply |
| `src/xrpld/overlay/detail/PeerSpanNames.h` | Peer Overlay | 2 | 5 | Proposal and validation receive |
> **Design convention**: SpanNames headers are colocated with their subsystem classes rather than centralized in `telemetry/`. See [memory/feedback_span-names-colocation.md](../.claude/memory/feedback_span-names-colocation.md) for rationale.
---
## 7. Known Issues
| Issue | Impact | Status |
| ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- |
| `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m``\|c` change in StatsDCollector.cpp |
| `xrpld_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `xrpld_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg |
| Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet |
---
## 8. Privacy and Data Collection
The telemetry system is designed with privacy in mind:
- **No private keys** are ever included in spans or metrics
- **No account balances** or financial data is traced
- **Transaction hashes** are included (public on-ledger data) but not transaction contents
- **Peer IDs** are internal identifiers, not IP addresses
- **All telemetry is opt-in** — disabled by default at build time (`-Dtelemetry=OFF`)
- **Sampling** reduces data volume — `sampling_ratio=0.01` recommended for production
- **Data stays local** — the default stack sends data to `localhost` only
---
## 9. Configuration Quick Reference
> **Full reference**: [05-configuration-reference.md](./05-configuration-reference.md) §5.1 for all `[telemetry]` options with defaults, the config parser implementation, and collector YAML configurations (dev and production).
### Minimal Setup (development)
```ini
[telemetry]
enabled=1
[insight]
server=statsd
address=127.0.0.1:8125
prefix=xrpld
```
### Production Setup
```ini
[telemetry]
enabled=1
endpoint=http://otel-collector:4318/v1/traces
sampling_ratio=0.01
trace_peer=0
batch_size=1024
max_queue_size=4096
[insight]
server=statsd
address=otel-collector:8125
prefix=xrpld
```
### Trace Category Toggle
| Config Key | Default | Controls |
| -------------------- | ------- | ---------------------------- |
| `trace_rpc` | `1` | `rpc.*` spans |
| `trace_transactions` | `1` | `tx.*` spans |
| `trace_consensus` | `1` | `consensus.*` spans |
| `trace_ledger` | `1` | `ledger.*` spans |
| `trace_peer` | `0` | `peer.*` spans (high volume) |

View File

@@ -56,6 +56,7 @@ flowchart TB
appendix["08-appendix.md"]
secure["secure-OTel.md"]
poc["POC_taskList.md"]
dataref["09-data-collection-reference.md"]
end
overview --> fundamentals
@@ -73,6 +74,7 @@ flowchart TB
backends --> appendix
backends --> secure
phases --> poc
appendix --> dataref
style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
style fundamentals fill:#00695c,stroke:#004d40,color:#fff
@@ -90,6 +92,7 @@ flowchart TB
style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
style secure fill:#4a148c,stroke:#2e0d57,color:#fff
style poc fill:#4a148c,stroke:#2e0d57,color:#fff
style dataref fill:#4a148c,stroke:#2e0d57,color:#fff
```
</div>
@@ -98,19 +101,20 @@ flowchart TB
## Table of Contents
| Section | Document | Description |
| ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- |
| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation |
| **1** | [Architecture Analysis](./01-architecture-analysis.md) | xrpld component analysis, trace points, instrumentation priorities |
| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation |
| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization |
| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules |
| **5** | [Configuration Reference](./05-configuration-reference.md) | xrpld config, CMake integration, Collector configurations |
| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics |
| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture |
| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history |
| **Sec** | [Securing the OTel Pipeline](./secure-OTel.md) | Threat model and hardening (mTLS, peer trace-context validation) |
| **POC** | [POC Task List](./POC_taskList.md) | Proof of concept tasks for RPC tracing end-to-end demo |
| Section | Document | Description |
| ------- | -------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation |
| **1** | [Architecture Analysis](./01-architecture-analysis.md) | xrpld component analysis, trace points, instrumentation priorities |
| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation |
| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization |
| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules |
| **5** | [Configuration Reference](./05-configuration-reference.md) | xrpld config, CMake integration, Collector configurations |
| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics |
| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture |
| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history |
| **9** | [Data Collection Reference](./09-data-collection-reference.md) | Complete inventory of spans, attributes, metrics, and dashboards |
| **Sec** | [Securing the OTel Pipeline](./secure-OTel.md) | Threat model and hardening (mTLS, peer trace-context validation) |
| **POC** | [POC Task List](./POC_taskList.md) | Proof of concept tasks for RPC tracing end-to-end demo |
---
@@ -224,6 +228,14 @@ The appendix contains a glossary of OpenTelemetry and xrpld-specific terms, refe
---
## 9. Data Collection Reference
A single-source-of-truth reference documenting every piece of telemetry data collected by xrpld. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 10 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples.
➡️ **[View Data Collection Reference](./09-data-collection-reference.md)**
---
## Securing the OTel Pipeline
Threat model and hardening guidance for production deployments where xrpld nodes ship telemetry to a centrally-hosted collector across an untrusted network. Covers the two attack surfaces (collector ingress and peer trace-context spoofing) and the chosen defenses: mTLS as primary collector auth, NetworkPolicy as defense-in-depth, and source-side validation plus per-peer rate limiting for the `protocol::TraceContext` field on peer messages.

View File

@@ -204,3 +204,36 @@ Node health (`amendment_blocked`, `server_state`) is not part of the telemetry s
**Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority).
**Dropped**: Task 2.8 (node health not duplicated on traces).
**Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this).
---
## Known Issues / Future Work
### Thread safety of TelemetryImpl::stop() vs startSpan()
`TelemetryImpl::stop()` resets `sdkProvider_` (a `std::shared_ptr`) without
synchronization. `getTracer()` reads the same member from RPC handler threads.
This is a data race if any thread calls `startSpan()` concurrently with `stop()`.
**Current mitigation**: `Application::stop()` shuts down `serverHandler_`,
`overlay_`, and `jobQueue_` before calling `telemetry_->stop()`, so no callers
remain. See comments in `Telemetry.cpp:stop()` and `Application.cpp`.
**TODO**: Add an `std::atomic<bool> stopped_` flag checked in `getTracer()` to
make this robust against future shutdown order changes.
### Macro incompatibility: XRPL_TRACE_SPAN vs XRPL_TRACE_SET_ATTR
`XRPL_TRACE_SPAN` and `XRPL_TRACE_SPAN_KIND` declare `_xrpl_guard_` as a bare
`SpanGuard`, but `XRPL_TRACE_SET_ATTR` and `XRPL_TRACE_EXCEPTION` call
`_xrpl_guard_.has_value()` which requires `std::optional<SpanGuard>`. Using
`XRPL_TRACE_SPAN` followed by `XRPL_TRACE_SET_ATTR` in the same scope would
fail to compile.
**Current mitigation**: No call site currently uses `XRPL_TRACE_SPAN` — all
production code uses the conditional macros (`XRPL_TRACE_RPC`, `XRPL_TRACE_TX`,
etc.) which correctly wrap the guard in `std::optional`.
**TODO**: Either make `XRPL_TRACE_SPAN`/`XRPL_TRACE_SPAN_KIND` also wrap in
`std::optional`, or document that `XRPL_TRACE_SET_ATTR` is only compatible with
the conditional macros.

View File

@@ -524,3 +524,14 @@ This gives the best of both worlds: guaranteed cross-node correlation via determ
- [ ] <5% overhead on transaction throughput
- [x] Deterministic trace_id: same trace_id for same tx across all nodes
- [x] Protobuf span_id propagation preserves parent-child ordering when available
---
## Known Issues / Future Work
### Unused trace_state proto field
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for
W3C `tracestate` vendor-specific key-value pairs but is not read or written by
`TraceContextPropagator`. Wire it when cross-vendor trace propagation is needed.
No wire cost since proto `optional` fields are zero-cost when absent.

View File

@@ -0,0 +1,221 @@
# Phase 5: Integration Test Task List
> **Goal**: End-to-end verification of the complete telemetry pipeline using a
> 6-node consensus network. Proves that RPC, transaction, and consensus spans
> flow through the observability stack (otel-collector, Tempo, Prometheus,
> Grafana) under realistic conditions.
>
> **Scope**: Integration test script, manual testing plan, 6-node local network
> setup, Tempo/Prometheus/Grafana verification.
>
> **Branch**: `pratik/otel-phase5-docs-deployment`
### Related Plan Documents
| Document | Relevance |
| ---------------------------------------------------------------- | ------------------------------------------ |
| [07-observability-backends.md](./07-observability-backends.md) | Tempo, Grafana, Prometheus setup |
| [05-configuration-reference.md](./05-configuration-reference.md) | Collector config, Docker Compose |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 5 tasks, definition of done |
| [Phase5_taskList.md](./Phase5_taskList.md) | Phase 5 main task list (5.6 = integration) |
---
## Task IT.1: Create Integration Test Script
**Objective**: Automated bash script that stands up a 6-node xrpld network
with telemetry, exercises all span categories, and verifies data in
Tempo/Prometheus.
**What to do**:
- Create `docker/telemetry/integration-test.sh`:
- Prerequisites check (docker, xrpld binary, curl, jq)
- Start observability stack via `docker compose`
- Generate 6 validator key pairs via temp standalone xrpld
- Generate 6 node configs + shared `validators.txt`
- Start 6 xrpld nodes in consensus mode (`--start`, no `-a`)
- Wait for all nodes to reach `"proposing"` state (120s timeout)
**Key new file**: `docker/telemetry/integration-test.sh`
**Verification**:
- [ ] Script starts without errors
- [ ] All 6 nodes reach "proposing" state
- [ ] Observability stack is healthy (otel-collector, Tempo, Prometheus, Grafana)
---
## Task IT.2: RPC Span Verification (Phase 2)
**Objective**: Verify RPC spans flow through the telemetry pipeline.
**What to do**:
- Send `server_info`, `server_state`, `ledger` RPCs to node1 (port 5005)
- Wait for batch export (5s)
- Query Tempo API for:
- `rpc.request` spans (ServerHandler::onRequest)
- `rpc.process` spans (ServerHandler::processRequest)
- `rpc.command.server_info` spans (callMethod)
- `rpc.command.server_state` spans (callMethod)
- `rpc.command.ledger` spans (callMethod)
- Verify `command` attribute present on `rpc.command.*` spans
**Verification**:
- [ ] Tempo shows `rpc.request` traces
- [ ] Tempo shows `rpc.process` traces
- [ ] Tempo shows `rpc.command.*` traces with correct attributes
---
## Task IT.3: Transaction Span Verification (Phase 3)
**Objective**: Verify transaction spans flow through the telemetry pipeline.
**What to do**:
- Get genesis account sequence via `account_info` RPC
- Submit Payment transaction using genesis seed (`snoPBrXtMeMyMHUVTgbuqAfg1SUTb`)
- Wait for consensus inclusion (10s)
- Query Tempo API for:
- `tx.process` spans (NetworkOPsImp::processTransaction) on submitting node
- `tx.receive` spans (PeerImp::handleTransaction) on peer nodes
- Verify `xrpl.tx.hash` attribute on `tx.process` spans
- Verify `xrpl.peer.id` attribute on `tx.receive` spans
**Verification**:
- [ ] Tempo shows `tx.process` traces with `xrpl.tx.hash`
- [ ] Tempo shows `tx.receive` traces with `xrpl.peer.id`
---
## Task IT.4: Consensus Span Verification (Phase 4)
**Objective**: Verify consensus spans flow through the telemetry pipeline.
**What to do**:
- Consensus runs automatically in 6-node network
- Query Tempo API for:
- `consensus.proposal.send` (Adaptor::propose)
- `consensus.ledger_close` (Adaptor::onClose)
- `consensus.accept` (Adaptor::onAccept)
- `consensus.validation.send` (Adaptor::validate)
- Verify attributes:
- `xrpl.consensus.mode` on `consensus.ledger_close`
- `xrpl.consensus.proposers` on `consensus.accept`
- `xrpl.consensus.ledger.seq` on `consensus.validation.send`
**Verification**:
- [ ] Tempo shows `consensus.ledger_close` traces with `xrpl.consensus.mode`
- [ ] Tempo shows `consensus.accept` traces with `xrpl.consensus.proposers`
- [ ] Tempo shows `consensus.proposal.send` traces
- [ ] Tempo shows `consensus.validation.send` traces
---
## Task IT.5: Spanmetrics Verification (Phase 5)
**Objective**: Verify spanmetrics connector derives RED metrics from spans.
**What to do**:
- Query Prometheus for `traces_span_metrics_calls_total`
- Query Prometheus for `traces_span_metrics_duration_milliseconds_count`
- Verify Grafana loads at `http://localhost:3000`
**Verification**:
- [ ] Prometheus returns non-empty results for `traces_span_metrics_calls_total`
- [ ] Prometheus returns non-empty results for duration histogram
- [ ] Grafana UI accessible with dashboards visible
---
## Task IT.6: Manual Testing Plan
**Objective**: Document how to run tests manually for future reference.
**What to do**:
- Create `docker/telemetry/TESTING.md` with:
- Prerequisites section
- Single-node standalone test (quick verification)
- 6-node consensus test (full verification)
- Expected span catalog (all 12 span names with attributes)
- Verification queries (Tempo API, Prometheus API)
- Troubleshooting guide
**Key new file**: `docker/telemetry/TESTING.md`
**Verification**:
- [ ] Document covers both single-node and multi-node testing
- [ ] All 12 span names documented with source file and attributes
- [ ] Troubleshooting section covers common failure modes
---
## Task IT.7: Run and Verify
**Objective**: Execute the integration test and validate results.
**What to do**:
- Run `docker/telemetry/integration-test.sh` locally
- Debug any failures
- Leave stack running for manual verification
- Share URLs:
- Tempo: `http://localhost:3200`
- Grafana: `http://localhost:3000`
- Prometheus: `http://localhost:9090`
**Verification**:
- [ ] Script completes with all checks passing
- [ ] Tempo UI shows xrpld service with all expected span names
- [ ] Grafana dashboards load and show data
---
## Task IT.8: Commit
**Objective**: Commit all new files to Phase 5 branch.
**What to do**:
- Run `pcc` (pre-commit checks)
- Commit 3 new files to `pratik/otel-phase5-docs-deployment`
**Verification**:
- [ ] `pcc` passes
- [ ] Commit created on Phase 5 branch
---
## Summary
| Task | Description | New Files | Depends On |
| ---- | ----------------------------- | --------- | ---------- |
| IT.1 | Integration test script | 1 | Phase 5 |
| IT.2 | RPC span verification | 0 | IT.1 |
| IT.3 | Transaction span verification | 0 | IT.1 |
| IT.4 | Consensus span verification | 0 | IT.1 |
| IT.5 | Spanmetrics verification | 0 | IT.1 |
| IT.6 | Manual testing plan | 1 | -- |
| IT.7 | Run and verify | 0 | IT.1-IT.6 |
| IT.8 | Commit | 0 | IT.7 |
**Exit Criteria**:
- [ ] All 6 xrpld nodes reach "proposing" state
- [ ] All 11 expected span names visible in Tempo
- [ ] Spanmetrics available in Prometheus
- [ ] Grafana dashboards show data
- [ ] Manual testing plan document complete

View File

@@ -31,10 +31,10 @@
explicit:
buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
dimensions:
- name: xrpl.rpc.command
- name: xrpl.rpc.status
- name: xrpl.consensus.phase
- name: xrpl.tx.type
- name: command
- name: rpc_status
- name: consensus_phase
- name: tx_type
```
- Add `prometheus` exporter:
```yaml

View File

@@ -108,6 +108,7 @@ words:
- enabled
- enablerepo
- endmacro
- EOCFG
- exceptioned
- EXPECT_STREQ
- exfiltration
@@ -200,6 +201,7 @@ words:
- nixfmt
- nixos
- nixpkgs
- NETOP
- NOLINT
- NOLINTNEXTLINE
- nonxrp
@@ -222,6 +224,8 @@ words:
- permdex
- perminute
- permissioned
- pgrep
- pkill
- pimpl
- pointee
- populator
@@ -242,6 +246,7 @@ words:
- Raphson
- reparent
- replayer
- reqps
- rerere
- retriable
- RIPD

2
docker/telemetry/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
# Runtime data generated by xrpld and telemetry stack
data/

522
docker/telemetry/TESTING.md Normal file
View File

@@ -0,0 +1,522 @@
# OpenTelemetry Integration Testing Guide
This document describes how to verify the xrpld OpenTelemetry telemetry
pipeline end-to-end, from span generation through the observability stack
(otel-collector, Tempo, Prometheus, Grafana).
---
## Prerequisites
### Build xrpld with telemetry
```bash
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default --target xrpld
```
The binary is at `.build/xrpld`.
### Required tools
- **Docker** with `docker compose` (v2)
- **curl**
- **jq** (JSON processor)
### Verify binary
```bash
.build/xrpld --version
```
---
## Test 1: Single-Node Standalone (Quick Verification)
This test verifies RPC and transaction spans in standalone mode. Consensus
spans will not fire because standalone mode does not run consensus.
### Step 1: Start the observability stack
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
Wait for services to be ready:
```bash
# otel-collector health
curl -sf http://localhost:13133/ && echo "collector ready"
# Tempo readiness
curl -sf http://localhost:3200/ready >/dev/null && echo "tempo ready"
```
### Step 2: Start xrpld in standalone mode
```bash
.build/xrpld --conf docker/telemetry/xrpld-telemetry.cfg -a --start
```
Wait a few seconds for the node to initialize.
### Step 3: Exercise RPC spans
```bash
# server_info
curl -s http://localhost:5005 \
-d '{"method":"server_info"}' | jq .result.info.server_state
# server_state
curl -s http://localhost:5005 \
-d '{"method":"server_state"}' | jq .result.state.server_state
# ledger
curl -s http://localhost:5005 \
-d '{"method":"ledger","params":[{"ledger_index":"current"}]}' |
jq .result.ledger_current_index
```
### Step 4: Submit a transaction
Close the ledger first (required in standalone mode):
```bash
curl -s http://localhost:5005 -d '{"method":"ledger_accept"}'
```
Submit a Payment from the genesis account:
```bash
curl -s http://localhost:5005 -d '{
"method": "submit",
"params": [{
"secret": "snoPBrXtMeMyMHUVTgbuqAfg1SUTb",
"tx_json": {
"TransactionType": "Payment",
"Account": "rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh",
"Destination": "rPMh7Pi9ct699iZUTWzJaUMR1o42VEfGqF",
"Amount": "10000000"
}
}]
}' | jq .result.engine_result
```
Expected result: `"tesSUCCESS"`.
Close the ledger again to finalize:
```bash
curl -s http://localhost:5005 -d '{"method":"ledger_accept"}'
```
### Step 5: Verify traces in Tempo
Wait 5 seconds for the batch export, then:
```bash
TEMPO="http://localhost:3200"
# Check xrpld service is registered
curl -s "$TEMPO/api/v2/search/tag/resource.service.name/values" | jq '.tagValues[].value'
# Check RPC spans
curl -s "$TEMPO/api/search" \
--data-urlencode 'q={resource.service.name="xrpld" && name="rpc.http_request"}' \
--data-urlencode 'limit=5' | jq '.traces | length'
curl -s "$TEMPO/api/search" \
--data-urlencode 'q={resource.service.name="xrpld" && name="rpc.process"}' \
--data-urlencode 'limit=5' | jq '.traces | length'
curl -s "$TEMPO/api/search" \
--data-urlencode 'q={resource.service.name="xrpld" && name="rpc.command.server_info"}' \
--data-urlencode 'limit=5' | jq '.traces | length'
# Check transaction spans
curl -s "$TEMPO/api/search" \
--data-urlencode 'q={resource.service.name="xrpld" && name="tx.process"}' \
--data-urlencode 'limit=5' | jq '.traces | length'
```
Or open Grafana Explore with Tempo datasource: http://localhost:3000
### Step 6: Teardown
```bash
# Kill xrpld (Ctrl+C or)
kill $(pgrep -f 'xrpld.*xrpld-telemetry')
# Stop observability stack
docker compose -f docker/telemetry/docker-compose.yml down
# Clean xrpld data
rm -rf data/
```
### Expected spans (standalone mode)
| Span Name | Expected | Notes |
| --------------------------- | -------- | ----------------------------- |
| `rpc.http_request` | Yes | Every HTTP RPC call |
| `rpc.process` | Yes | Every RPC processing |
| `rpc.command.server_info` | Yes | server_info RPC |
| `rpc.command.server_state` | Yes | server_state RPC |
| `rpc.command.ledger` | Yes | ledger RPC |
| `rpc.command.submit` | Yes | submit RPC |
| `rpc.command.ledger_accept` | Yes | ledger_accept RPC |
| `tx.process` | Yes | Transaction submission |
| `tx.receive` | No | No peers in standalone |
| `consensus.*` | No | Consensus disabled standalone |
---
## Test 2: 6-Node Consensus Network (Full Verification)
This test verifies ALL span categories including consensus and peer
transaction relay, using a 6-node validator network.
### Automated
Run the integration test script:
```bash
bash docker/telemetry/integration-test.sh
```
The script will:
1. Start the observability stack
2. Generate 6 validator key pairs
3. Create config files for each node
4. Start all 6 nodes
5. Wait for consensus ("proposing" state)
6. Exercise RPC, submit transactions
7. Verify all span categories in Tempo
8. Verify spanmetrics in Prometheus
9. Print results and leave the stack running
### Manual
If you prefer to run the steps manually:
#### Step 1: Start observability stack
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
#### Step 2: Generate validator keys
Start a temporary standalone xrpld:
```bash
.build/xrpld --conf docker/telemetry/xrpld-telemetry.cfg -a --start &
TEMP_PID=$!
sleep 5
```
Generate 6 key pairs:
```bash
for i in $(seq 1 6); do
curl -s http://localhost:5005 \
-d '{"method":"validation_create"}' | jq '.result'
done
```
Record the `validation_seed` and `validation_public_key` for each.
Kill the temporary node:
```bash
kill $TEMP_PID
rm -rf data/
```
#### Step 3: Create node configs
For each node (1-6), create a config file. Template:
```ini
[server]
port_rpc
port_peer
[port_rpc]
port = {5004 + node_number}
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http
[port_peer]
port = {51234 + node_number}
ip = 0.0.0.0
protocol = peer
[node_db]
type=NuDB
path=/tmp/xrpld-integration/node{N}/nudb
online_delete=256
[database_path]
/tmp/xrpld-integration/node{N}/db
[debug_logfile]
/tmp/xrpld-integration/node{N}/debug.log
[validation_seed]
{seed from step 2}
[validators_file]
/tmp/xrpld-integration/validators.txt
[ips_fixed]
127.0.0.1 51235
127.0.0.1 51236
127.0.0.1 51237
127.0.0.1 51238
127.0.0.1 51239
127.0.0.1 51240
[peer_private]
1
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
sampling_ratio=1.0
batch_size=512
batch_delay_ms=2000
max_queue_size=2048
trace_rpc=1
trace_transactions=1
trace_consensus=1
trace_peer=0
trace_ledger=1
[rpc_startup]
{ "command": "log_level", "severity": "warning" }
[ssl_verify]
0
```
#### Step 4: Create validators.txt
```ini
[validators]
{public_key_1}
{public_key_2}
{public_key_3}
{public_key_4}
{public_key_5}
{public_key_6}
```
#### Step 5: Start all 6 nodes
```bash
for i in $(seq 1 6); do
.build/xrpld --conf /tmp/xrpld-integration/node$i/xrpld.cfg --start &
echo $! >/tmp/xrpld-integration/node$i/xrpld.pid
done
```
#### Step 6: Wait for consensus
Poll each node until `server_state` = `"proposing"`:
```bash
for port in 5005 5006 5007 5008 5009 5010; do
while true; do
state=$(curl -s http://localhost:$port \
-d '{"method":"server_info"}' |
jq -r '.result.info.server_state')
echo "Port $port: $state"
[ "$state" = "proposing" ] && break
sleep 5
done
done
```
#### Step 7: Exercise RPC and submit transaction
```bash
# RPC calls
curl -s http://localhost:5005 -d '{"method":"server_info"}'
curl -s http://localhost:5005 -d '{"method":"server_state"}'
curl -s http://localhost:5005 -d '{"method":"ledger","params":[{"ledger_index":"current"}]}'
# Submit transaction
curl -s http://localhost:5005 -d '{
"method": "submit",
"params": [{
"secret": "snoPBrXtMeMyMHUVTgbuqAfg1SUTb",
"tx_json": {
"TransactionType": "Payment",
"Account": "rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh",
"Destination": "rPMh7Pi9ct699iZUTWzJaUMR1o42VEfGqF",
"Amount": "10000000"
}
}]
}'
```
Wait 15 seconds for consensus and batch export.
#### Step 8: Verify in Tempo
See the "Verification Queries" section below.
---
## Expected Span Catalog
All 16 production span names instrumented across Phases 2-5:
| Span Name | Source File | Phase | Key Attributes | How to Trigger |
| --------------------------- | ----------------- | ----- | ---------------------------------------------------------------------------------------- | ------------------------- |
| `rpc.http_request` | ServerHandler.cpp | 2 | -- | Any HTTP RPC call |
| `rpc.ws_upgrade` | ServerHandler.cpp | 2 | -- | WebSocket upgrade |
| `rpc.ws_message` | ServerHandler.cpp | 2 | -- | WebSocket RPC message |
| `rpc.process` | ServerHandler.cpp | 2 | -- | RPC processing |
| `rpc.command.<name>` | RPCHandler.cpp | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
| `tx.process` | NetworkOPs.cpp | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
| `tx.receive` | PeerImp.cpp | 3 | `xrpl.peer.id` | Peer relays transaction |
| `consensus.proposal.send` | RCLConsensus.cpp | 4 | `xrpl.consensus.round` | Consensus proposing phase |
| `consensus.ledger_close` | RCLConsensus.cpp | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.accept` | RCLConsensus.cpp | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
| `consensus.validation.send` | RCLConsensus.cpp | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
| `consensus.accept.apply` | RCLConsensus.cpp | 4 | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state` | Ledger apply + close time |
| `tx.apply` | BuildLedger.cpp | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
| `ledger.build` | BuildLedger.cpp | 5 | `xrpl.ledger.seq`, `xrpl.ledger.close_time`, `close_time_correct`, `close_resolution_ms` | Ledger build |
| `ledger.validate` | LedgerMaster.cpp | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
| `ledger.store` | LedgerMaster.cpp | 5 | `xrpl.ledger.seq` | Ledger stored |
| `peer.proposal.receive` | PeerImp.cpp | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
| `peer.validation.receive` | PeerImp.cpp | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |
---
## Verification Queries
### Tempo API
Base URL: `http://localhost:3200`
```bash
TEMPO="http://localhost:3200"
# List all services
curl -s "$TEMPO/api/v2/search/tag/resource.service.name/values" | jq '.tagValues[].value'
# Query traces by operation
for op in "rpc.http_request" "rpc.ws_upgrade" "rpc.ws_message" "rpc.process" \
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
"tx.process" "tx.receive" "tx.apply" \
"consensus.proposal.send" "consensus.ledger_close" \
"consensus.accept" "consensus.accept.apply" \
"consensus.validation.send" \
"ledger.build" "ledger.validate" "ledger.store" \
"peer.proposal.receive" "peer.validation.receive"; do
count=$(curl -s "$TEMPO/api/search" \
--data-urlencode "q={resource.service.name=\"xrpld\" && name=\"$op\"}" \
--data-urlencode "limit=5" |
jq '.traces | length')
printf "%-35s %s traces\n" "$op" "$count"
done
```
### Prometheus API
Base URL: `http://localhost:9090`
```bash
PROM="http://localhost:9090"
# Span call counts (from spanmetrics connector)
curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total" |
jq '.data.result[] | {span: .metric.span_name, count: .value[1]}'
# Latency histogram
curl -s "$PROM/api/v1/query?query=traces_span_metrics_duration_milliseconds_count" |
jq '.data.result[] | {span: .metric.span_name, count: .value[1]}'
# RPC calls by command
curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}" |
jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
```
### Grafana
Open http://localhost:3000 (anonymous admin access enabled).
Pre-configured dashboards:
- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
Pre-configured datasources:
- **Tempo**: Trace data at `http://tempo:3200`
- **Prometheus**: Metrics at `http://prometheus:9090`
---
## Troubleshooting
### No traces in Tempo
1. Check otel-collector logs:
```bash
docker compose -f docker/telemetry/docker-compose.yml logs otel-collector
```
2. Verify xrpld telemetry config has `enabled=1` and correct endpoint
3. Check that otel-collector port 4318 is accessible:
```bash
curl -sf http://localhost:4318 && echo "reachable"
```
4. Increase `batch_delay_ms` or decrease `batch_size` in xrpld config
### Nodes not reaching "proposing" state
1. Check that all peer ports (51235-51240) are not in use:
```bash
for p in 51235 51236 51237 51238 51239 51240; do
ss -tlnp | grep ":$p " && echo "port $p in use"
done
```
2. Verify `[ips_fixed]` lists all 6 peer ports
3. Verify `validators.txt` has all 6 public keys
4. Check node debug logs: `tail -50 /tmp/xrpld-integration/node1/debug.log`
5. Ensure `[peer_private]` is set to `1` (prevents reaching out to public network)
### Transaction not processing
1. Verify genesis account exists:
```bash
curl -s http://localhost:5005 \
-d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}' |
jq .result.account_data.Balance
```
2. Check submit response for error codes
3. In standalone mode, remember to call `ledger_accept` after submitting
### Spanmetrics not appearing in Prometheus
1. Verify otel-collector config has `spanmetrics` connector
2. Check that the metrics pipeline is configured:
```yaml
service:
pipelines:
metrics:
receivers: [spanmetrics]
exporters: [prometheus]
```
3. Verify Prometheus can reach collector:
```bash
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'
```

View File

@@ -7,7 +7,7 @@
# - tempo: Grafana Tempo tracing backend, queryable via Grafana Explore
# on port 3000. Recommended for production (S3/GCS storage, TraceQL).
# - grafana: dashboards on port 3000, pre-configured with Tempo
# datasource.
# and Prometheus datasources.
#
# Usage:
# docker compose -f docker/telemetry/docker-compose.yml up -d
@@ -24,9 +24,11 @@ services:
image: otel/opentelemetry-collector-contrib:0.121.0
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver (xrpld sends traces here)
- "13133:13133" # Health check endpoint
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8125:8125/udp" # StatsD UDP (beast::insight metrics)
- "8889:8889" # Prometheus metrics (spanmetrics + statsd)
- "13133:13133" # Health check
volumes:
# Mount collector pipeline config (receivers → processors → exporters)
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
@@ -50,6 +52,17 @@ services:
networks:
- xrpld-telemetry
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
depends_on:
- otel-collector
networks:
- xrpld-telemetry
# Grafana: visualization UI with Tempo pre-configured as a datasource.
# Anonymous admin access enabled for local development convenience.
grafana:
@@ -62,8 +75,10 @@ services:
volumes:
# Auto-provision Tempo datasource and search filters on startup
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
depends_on:
- tempo
- prometheus
networks:
- xrpld-telemetry

View File

@@ -0,0 +1,858 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Consensus Round Duration",
"description": "p95 and p50 duration of consensus accept rounds. The consensus.accept span (RCLConsensus.cpp) measures the time to process an accepted ledger including transaction application and state finalization. The span carries proposers and round_time_ms attributes. Normal range is 3-6 seconds on mainnet.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.accept\"}[5m])))",
"legendFormat": "P95 Round Duration [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.accept\"}[5m])))",
"legendFormat": "P50 Round Duration [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Consensus Proposals Sent Rate",
"description": "Rate at which this node sends consensus proposals to the network. Sourced from the consensus.proposal.send span (RCLConsensus.cpp) which fires each time the node proposes a transaction set. The span carries xrpl.consensus.round identifying the consensus round number. A healthy proposing node should show steady proposal output.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.proposal.send\"}[5m]))",
"legendFormat": "Proposals / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Proposals / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Close Duration",
"description": "p95 duration of the ledger close event. The consensus.ledger_close span (RCLConsensus.cpp) measures the time from when consensus triggers a ledger close to completion. Carries xrpl.ledger.seq and xrpl.consensus.mode attributes. Compare with Consensus Round Duration to understand how close timing relates to overall round time.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.ledger_close\"}[5m])))",
"legendFormat": "P95 Close Duration [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation Send Rate",
"description": "Rate at which this node sends ledger validations to the network. Sourced from the consensus.validation.send span (RCLConsensus.cpp). Each validation confirms the node has fully validated a ledger. The span carries xrpl.ledger.seq and proposing. Should closely track the ledger close rate when the node is healthy.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.validation.send\"}[5m]))",
"legendFormat": "Validations / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Ledger Apply Duration (doAccept)",
"description": "Time spent applying the consensus result to build a new ledger. Measured by the consensus.accept.apply span in doAccept().",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.accept.apply\"}[5m])))",
"legendFormat": "P95 Apply Duration [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.accept.apply\"}[5m])))",
"legendFormat": "P50 Apply Duration [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Close Time Agreement",
"description": "Rate of close time agreement vs disagreement across consensus rounds. Based on close_time_correct attribute (true = validators agreed, false = agreed to disagree per avCT_CONSENSUS_PCT).",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (close_time_correct, exported_instance) (rate(traces_span_metrics_calls_total{span_name=\"consensus.accept.apply\", xrpl_consensus_mode=~\"$consensus_mode\", exported_instance=~\"$node\"}[$__rate_interval]))",
"legendFormat": "Close Time Correct={{close_time_correct}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Rounds / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Consensus Mode Over Time",
"description": "Breakdown of consensus ledger close events by the node's consensus mode (Proposing, Observing, Wrong Ledger, Switched Ledger). Grouped by the xrpl.consensus.mode span attribute from consensus.ledger_close. A healthy validator should be predominantly in Proposing mode. Frequent Wrong Ledger or Switched Ledger indicates sync issues.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (xrpl_consensus_mode, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", xrpl_consensus_mode=~\"$consensus_mode\", span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "{{xrpl_consensus_mode}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Accept vs Close Rate",
"description": "Compares the rate of consensus.accept (ledger accepted after consensus) vs consensus.ledger_close (ledger close initiated). These should track closely in a healthy network. A divergence means some close events are not completing the accept phase, potentially indicating consensus failures or timeouts.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.accept\"}[5m]))",
"legendFormat": "Accepts / Sec [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "Closes / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation vs Close Rate",
"description": "Compares the rate of consensus.validation.send vs consensus.ledger_close. Each validated ledger should produce one validation message. If validations lag behind closes, the node may be falling behind on validation or experiencing issues with the validation pipeline.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.validation.send\"}[5m]))",
"legendFormat": "Validations / Sec [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "Closes / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Consensus Accept Duration Heatmap",
"description": "Heatmap showing the distribution of consensus.accept span durations across histogram buckets over time. Each cell represents how many accept events fell into that duration bucket in a 5m window. Useful for detecting outlier consensus rounds that take abnormally long.",
"type": "heatmap",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"yAxis": {
"axisLabel": "Duration (ms)"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"consensus.accept\"}[5m])) by (le)",
"legendFormat": "{{le}}",
"format": "heatmap"
}
]
},
{
"title": "Close Time: Raw Proposals (Per Node)",
"description": "Each node's raw proposed close time (close_time_self) \u2014 the unrounded wall clock value at the moment the node closed its ledger. Compare across nodes to see clock drift.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 40
},
"fieldConfig": {
"defaults": {
"unit": "dateTimeFromNow",
"custom": {
"drawStyle": "points",
"pointSize": 6,
"showPoints": "always"
}
},
"overrides": []
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\"} | select(span.close_time_self)",
"refId": "A"
}
]
},
{
"title": "Close Time: Effective / Quantized",
"description": "The consensus-agreed close time after rounding to the current resolution bin (close_time). This is the value written to the ledger header. All nodes in agreement produce the same value.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 40
},
"fieldConfig": {
"defaults": {
"unit": "dateTimeFromNow",
"custom": {
"drawStyle": "points",
"pointSize": 6,
"showPoints": "always"
}
},
"overrides": []
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\"} | select(span.close_time)",
"refId": "A"
}
]
},
{
"title": "Close Time Vote Bins & Resolution",
"description": "Number of distinct close time vote bins (close_time_vote_bins) and the bin size / resolution in ms (close_resolution_ms). More bins = more clock disagreement. Resolution adapts: finer (10s) when validators agree, coarser (120s) when they disagree.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 48
},
"fieldConfig": {
"defaults": {
"custom": {
"drawStyle": "line",
"lineInterpolation": "stepAfter",
"pointSize": 5,
"showPoints": "auto"
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Vote Bins"
},
"properties": [
{
"id": "unit",
"value": "short"
},
{
"id": "custom.axisPlacement",
"value": "left"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Resolution"
},
"properties": [
{
"id": "unit",
"value": "ms"
},
{
"id": "custom.axisPlacement",
"value": "right"
}
]
}
]
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\"} | select(span.close_time_vote_bins)",
"refId": "A"
},
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\"} | select(span.close_resolution_ms)",
"refId": "B"
}
]
},
{
"title": "Close Time Resolution Direction",
"description": "Whether close time resolution increased (coarser bins, more disagreement), decreased (finer bins, better agreement), or stayed unchanged relative to the previous ledger. Based on resolution_direction attribute.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 48
},
"fieldConfig": {
"defaults": {
"custom": {
"drawStyle": "bars",
"fillOpacity": 40,
"pointSize": 5,
"showPoints": "auto"
}
},
"overrides": []
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\" && span.resolution_direction=~\"$resolution_direction\"} | select(span.resolution_direction)",
"refId": "A"
}
]
},
{
"title": "Close Time Bin Distribution",
"description": "Distribution of raw proposed close times across quantized bins. Shows how many nodes' proposals landed in each resolution bin per consensus round. A single dominant bin indicates good clock agreement; spread across bins indicates drift or network latency.",
"type": "barchart",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 56
},
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"fillOpacity": 60
}
},
"overrides": []
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["sum"]
},
"xTickLabelRotation": -45,
"barWidth": 0.8,
"stacking": "normal"
},
"targets": [
{
"datasource": {
"type": "tempo"
},
"queryType": "traceql",
"query": "{name=\"consensus.accept.apply\" && resource.service.instance.id=~\"$node\" && span.close_time_correct=~\"$close_time_correct\"} | select(span.close_time, span.close_time_vote_bins)",
"refId": "A"
}
]
},
{
"title": "Consensus Outcome Distribution",
"description": "Distribution of consensus.accept outcomes: yes (normal), moved_on (without full agreement), expired (timeout). Non-yes outcomes indicate network stress.",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 64
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
},
"tooltip": {
"mode": "multi"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (consensus_state) (increase(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.accept\", consensus_state!=\"\"}[5m]))",
"legendFormat": "{{consensus_state}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short"
},
"overrides": []
}
},
{
"title": "Consensus Failures Over Time",
"description": "Rate of non-normal consensus outcomes (moved_on + expired). Spikes indicate consensus instability.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 16,
"x": 8,
"y": 64
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.accept\", consensus_state=\"moved_on\"}[5m]))",
"legendFormat": "moved_on [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"consensus.accept\", consensus_state=\"expired\"}[5m]))",
"legendFormat": "expired [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Failures / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "consensus", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total, exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "consensus_mode",
"label": "Consensus Mode",
"description": "Filter by consensus mode (Proposing, Observing, Wrong Ledger, Switched Ledger)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}, xrpl_consensus_mode)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "close_time_correct",
"label": "Close Time Agreed",
"type": "custom",
"query": "true,false",
"current": {
"text": "All",
"value": "$__all"
},
"includeAll": true,
"allValue": ".*",
"multi": true,
"options": [
{
"text": "All",
"value": "$__all",
"selected": true
},
{
"text": "true",
"value": "true",
"selected": false
},
{
"text": "false",
"value": "false",
"selected": false
}
]
},
{
"name": "resolution_direction",
"label": "Resolution Direction",
"type": "custom",
"query": "increased,decreased,unchanged",
"current": {
"text": "All",
"value": "$__all"
},
"includeAll": true,
"allValue": ".*",
"multi": true,
"options": [
{
"text": "All",
"value": "$__all",
"selected": true
},
{
"text": "increased",
"value": "increased",
"selected": false
},
{
"text": "decreased",
"value": "decreased",
"selected": false
},
{
"text": "unchanged",
"value": "unchanged",
"selected": false
}
]
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Consensus Health",
"uid": "xrpld-consensus"
}

View File

@@ -0,0 +1,353 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Ledger Build Rate",
"description": "Rate at which new ledgers are being built. The ledger.build span (BuildLedger.cpp) wraps the entire buildLedgerImpl() function which creates a new ledger from a parent, applies transactions, flushes SHAMap nodes, and sets the accepted state. Should match the consensus close rate (~0.25/sec on mainnet with ~4s rounds).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"ledger.build\"}[5m]))",
"legendFormat": "Builds / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Ledger Build Duration",
"description": "p95 and p50 duration of ledger builds. Measures the full buildLedgerImpl() call including transaction application, SHAMap flushing, and ledger acceptance. The span records xrpl.ledger.seq as an attribute. Long build times indicate expensive transaction sets or I/O pressure from SHAMap flushes.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"ledger.build\"}[5m])))",
"legendFormat": "P95 Build Duration [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"ledger.build\"}[5m])))",
"legendFormat": "P50 Build Duration [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Validation Rate",
"description": "Rate at which ledgers pass the validation threshold and are accepted as fully validated. The ledger.validate span (LedgerMaster.cpp) fires in checkAccept() only after the ledger receives sufficient trusted validations (>= quorum). Records xrpl.ledger.seq and validations (the number of validations received).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"ledger.validate\"}[5m]))",
"legendFormat": "Validations / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Ledger Build Duration Heatmap",
"description": "Heatmap showing the distribution of ledger.build durations across histogram buckets over time. Each cell represents the count of ledger builds that fell into that duration bucket in a 5m window. Useful for spotting occasional slow ledger builds that may not appear in percentile charts.",
"type": "heatmap",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"yAxis": {
"axisLabel": "Duration (ms)"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"ledger.build\"}[5m])) by (le)",
"legendFormat": "{{le}}",
"format": "heatmap"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms"
},
"overrides": []
}
},
{
"title": "Transaction Apply Duration",
"description": "p95 and p50 duration of applying the consensus transaction set during ledger building. The tx.apply span (BuildLedger.cpp) wraps applyTransactions() which iterates through the CanonicalTXSet with multiple retry passes. Records tx_count (successful) and tx_failed (failed) as attributes.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.apply\"}[5m])))",
"legendFormat": "P95 tx.apply [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.apply\"}[5m])))",
"legendFormat": "P50 tx.apply [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Apply Rate",
"description": "Rate of tx.apply span invocations, reflecting how frequently the transaction application phase runs during ledger building. Each ledger build triggers one tx.apply call. Should closely match the ledger build rate.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.apply\"}[5m]))",
"legendFormat": "tx.apply / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Operations / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Store Rate",
"description": "Rate at which ledgers are stored into the ledger history. The ledger.store span (LedgerMaster.cpp) wraps storeLedger() which inserts the ledger into the LedgerHistory cache. Records xrpl.ledger.seq. Should match the ledger build rate under normal operation.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"ledger.store\"}[5m]))",
"legendFormat": "Stores / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Build vs Close Duration",
"description": "Compares p95 durations of ledger.build (the actual ledger construction in BuildLedger.cpp) vs consensus.ledger_close (the consensus close event in RCLConsensus.cpp). Build time is a subset of close time. A large gap between them indicates overhead in the consensus pipeline outside of ledger construction itself.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"ledger.build\"}[5m])))",
"legendFormat": "P95 ledger.build [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"consensus.ledger_close\"}[5m])))",
"legendFormat": "P95 consensus.ledger_close [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "ledger", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total, exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Ledger Operations",
"uid": "xrpld-ledger-ops"
}

View File

@@ -0,0 +1,227 @@
{
"annotations": {
"list": []
},
"description": "Requires trace_peer=1 in the [telemetry] config section.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Peer Proposal Receive Rate",
"description": "Rate of consensus proposals received from network peers. The peer.proposal.receive span (PeerImp.cpp) fires in onMessage(TMProposeSet) for each incoming proposal. Records xrpl.peer.id (sending peer) and proposal_trusted (whether the proposer is in our UNL). Requires trace_peer=1 in the telemetry config.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"peer.proposal.receive\"}[5m]))",
"legendFormat": "Proposals Received / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Proposals / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Peer Validation Receive Rate",
"description": "Rate of ledger validations received from network peers. The peer.validation.receive span (PeerImp.cpp) fires in onMessage(TMValidation) for each incoming validation message. Records xrpl.peer.id (sending peer) and validation_trusted (whether the validator is trusted). Requires trace_peer=1 in the telemetry config.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"peer.validation.receive\"}[5m]))",
"legendFormat": "Validations Received / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Validations / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Proposals Trusted vs Untrusted",
"description": "Pie chart showing the ratio of proposals received from trusted validators (in our UNL) vs untrusted validators. Grouped by the proposal_trusted span attribute (true/false). A healthy node connected to a well-configured UNL should see a significant portion of trusted proposals. Note: proposals that fail early validation may not have the trusted attribute set.",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (proposal_trusted, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", proposal_trusted=~\"$proposal_trusted\", span_name=\"peer.proposal.receive\"}[5m]))",
"legendFormat": "Trusted = {{proposal_trusted}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Validations Trusted vs Untrusted",
"description": "Pie chart showing the ratio of validations received from trusted validators (in our UNL) vs untrusted validators. Grouped by the validation_trusted span attribute (true/false). Monitoring this helps detect if the node is receiving validations from the expected set of trusted validators. Note: validations that fail early checks may not have the trusted attribute set.",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (validation_trusted, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", validation_trusted=~\"$validation_trusted\", span_name=\"peer.validation.receive\"}[5m]))",
"legendFormat": "Trusted = {{validation_trusted}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "peer", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total, exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "proposal_trusted",
"label": "Proposal Trusted",
"description": "Filter by proposal trust status (true = from trusted validator)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total{span_name=\"peer.proposal.receive\"}, proposal_trusted)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "validation_trusted",
"label": "Validation Trusted",
"description": "Filter by validation trust status (true = from trusted validator)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total{span_name=\"peer.validation.receive\"}, validation_trusted)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Peer Network",
"uid": "xrpld-peer-net"
}

View File

@@ -0,0 +1,466 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "RPC Request Rate by Command",
"description": "Per-second rate of RPC command executions, broken down by command name (e.g. server_info, submit). Calculated as rate(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}) over a 5m window, grouped by the command span attribute.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (command, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\"}[5m]))",
"legendFormat": "{{command}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"axisLabel": "Requests / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Latency P95 by Command",
"description": "95th percentile response time for each RPC command. Computed from the spanmetrics duration histogram using histogram_quantile(0.95) over rpc.command.* spans, grouped by command. High values indicate slow commands that may need optimization.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, command, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\"}[5m])))",
"legendFormat": "P95 {{command}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Error Rate",
"description": "Percentage of RPC commands that completed with an error status, per command. Calculated as (error calls / total calls) * 100, where errors have status_code=STATUS_CODE_ERROR. Thresholds: green < 1%, yellow 1-5%, red > 5%.",
"type": "bargauge",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (command, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\", status_code=\"STATUS_CODE_ERROR\"}[5m])) / sum by (command, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\"}[5m])) * 100",
"legendFormat": "{{command}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "red",
"value": 5
}
]
}
},
"overrides": []
}
},
{
"title": "RPC Latency Heatmap",
"description": "Distribution of RPC command response times across histogram buckets. Shows the density of requests at each latency level over time. Each cell represents the count of requests that fell into that duration bucket in a 5m window. Useful for spotting bimodal latency patterns.",
"type": "heatmap",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"yAxis": {
"axisLabel": "Duration (ms)"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\"}[5m])) by (le)",
"legendFormat": "{{le}}",
"format": "heatmap"
}
]
},
{
"title": "Overall RPC Throughput",
"description": "Aggregate RPC throughput showing two layers of the request pipeline. rpc.http_request is the outer HTTP handler (ServerHandler.cpp) that accepts incoming connections. rpc.process is the inner processing layer (ServerHandler.cpp) that parses and dispatches. A gap between the two indicates requests being queued or rejected before processing.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=\"rpc.http_request\"}[5m]))",
"legendFormat": "rpc.http_request / Sec [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=\"rpc.process\"}[5m]))",
"legendFormat": "rpc.process / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"axisLabel": "Requests / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Success vs Error",
"description": "Aggregate rate of successful vs failed RPC commands across all command types. Success = status_code UNSET (OpenTelemetry default for OK spans). Error = status_code STATUS_CODE_ERROR. A sustained error rate warrants investigation via per-command breakdown above.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\", status_code=\"STATUS_CODE_UNSET\"}[5m]))",
"legendFormat": "Success [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\", status_code=\"STATUS_CODE_ERROR\"}[5m]))",
"legendFormat": "Error [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Commands / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Top Commands by Volume",
"description": "Top 10 most frequently called RPC commands by total invocation count over the last 5 minutes. Uses topk(10, increase(calls_total)) to rank commands. Helps identify the hottest API endpoints driving load on the node.",
"type": "bargauge",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk(10, sum by (command, exported_instance) (increase(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=~\"rpc.command.*\"}[5m])))",
"legendFormat": "{{command}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "none"
},
"overrides": []
}
},
{
"title": "WebSocket Message Rate",
"description": "Rate of incoming WebSocket RPC messages processed by the server. Sourced from the rpc.ws_message span (ServerHandler.cpp). Only active when clients connect via WebSocket instead of HTTP. Zero is normal if only HTTP RPC is in use.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", command=~\"$command\", span_name=\"rpc.ws_message\"}[5m]))",
"legendFormat": "WS Messages / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "RPC Resource Cost by Command",
"description": "RPC commands grouped by load_type (resource cost category). High-cost categories like exception_rpc or malformed_rpc indicate problematic clients.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (load_type) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=~\"rpc.command.*\", load_type!=\"\"}[5m]))",
"legendFormat": "{{load_type}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Requests / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Batch vs Single RPC Requests",
"description": "Rate of batch RPC requests vs single requests. High batch rate may indicate bulk automation clients.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"rpc.process\", is_batch=\"true\"}[5m]))",
"legendFormat": "Batch [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"rpc.process\", is_batch=\"false\"}[5m]))",
"legendFormat": "Single [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Requests / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "rpc", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total, exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "command",
"label": "RPC Command",
"description": "Filter by RPC command name (e.g., server_info, submit)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}, command)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "RPC Performance",
"uid": "xrpld-rpc-perf"
}

View File

@@ -0,0 +1,527 @@
{
"annotations": {
"list": []
},
"description": "Ledger data exchange and object fetch traffic from beast::insight StatsD. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Ledger Data Exchange (Bytes In)",
"description": "Inbound bytes for ledger data sub-categories. 'ledger_data' = aggregated ledger data, sub-types include Transaction_Set_candidate (proposed tx sets), Transaction_Node (tx tree nodes), and Account_State_Node (state tree nodes). High Account_State_Node traffic indicates state sync; high Transaction_Set_candidate indicates consensus catch-up. Sourced from TrafficCount.h ledger_data_* categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Data Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Data Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Set Candidate Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Set Candidate Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Node Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Node Share"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes In",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Share/Get Traffic (Bytes)",
"description": "Legacy ledger share and get traffic by sub-type. These are the older ledger fetch protocol categories (as opposed to ledger_data_* which is the newer protocol). Sub-types: Transaction_Set_candidate, Transaction_node, Account_State_node, plus aggregate ledger_share and ledger_get. Sourced from TrafficCount.h ledger_* categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Share In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Get In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Set Candidate Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Set Candidate Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledger_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Get"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes In",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "GetObject Traffic by Type (Bytes In)",
"description": "Object fetch traffic by object type. GetObject is the protocol for fetching specific SHAMap nodes. Types: Ledger (full ledger headers), Transaction (individual txs), Transaction_node (tx tree nodes), Account_State_node (state tree nodes), CAS (Content Addressable Storage objects), Fetch_Pack (batch fetch during catch-up), Transactions (bulk tx fetch). High Fetch_Pack traffic indicates a node is catching up. Sourced from TrafficCount.h getobject_* categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Ledger_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Ledger_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Transaction Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Transaction Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Share"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes In",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "GetObject Aggregate & Special Types (Bytes In)",
"description": "Aggregate getobject traffic plus special categories: CAS (Content Addressable Storage) for SHAMap node fetch, Fetch_Pack for bulk batch downloads during catch-up, Transactions for bulk tx fetch, and the aggregate getobject_get/getobject_share totals. Sourced from TrafficCount.h getobject_* categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_CAS_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "CAS Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_CAS_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "CAS Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Fetch_Pack_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Fetch Pack Share"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Fetch_Pack_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Fetch Pack Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transactions_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Transactions Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Aggregate Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Aggregate Share"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes In",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "GetObject Messages by Type",
"description": "Message counts for object fetch operations. Shows how many individual fetch requests and responses are exchanged per type. High message counts with low byte counts indicate small object fetches; the inverse indicates large batch transfers. Sourced from TrafficCount.h getobject_* categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Ledger_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Ledger Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Transaction Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transaction_node_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Node Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Account_State_node_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Account State Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_CAS_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "CAS Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Fetch_Pack_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Fetch Pack Get"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_getobject_Transactions_get_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Transactions Get"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages In",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Overlay Traffic Heatmap (All Categories, Bytes In)",
"description": "Bar gauge showing all overlay traffic categories ranked by inbound bytes. Provides a complete at-a-glance view of which protocol message types consume the most bandwidth across all 57+ traffic categories. Sourced from all TrafficCount.h categories via wildcard match.",
"type": "bargauge",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"displayMode": "gradient",
"orientation": "horizontal",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk{exported_instance=~\"$node\"}(20, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
"legendFormat": "{{__name__}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1048576
},
{
"color": "red",
"value": 104857600
}
]
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "ledger", "sync", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by xrpld node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Ledger Data & Sync (StatsD)",
"uid": "xrpld-statsd-ledger-sync"
}

View File

@@ -0,0 +1,805 @@
{
"annotations": {
"list": []
},
"description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Active Peers",
"description": "Number of active inbound and outbound peer connections. Sourced from Peer_Finder.Active_Inbound_Peers and Peer_Finder.Active_Outbound_Peers gauges (PeerfinderManager.cpp). A healthy mainnet node typically has 10-21 outbound and 0-85 inbound peers depending on configuration.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_Peer_Finder_Active_Inbound_Peers{exported_instance=~\"$node\"}",
"legendFormat": "Inbound Peers"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_Peer_Finder_Active_Outbound_Peers{exported_instance=~\"$node\"}",
"legendFormat": "Outbound Peers"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Peers",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Peer Disconnects",
"description": "Cumulative count of peer disconnections. Sourced from the Overlay.Peer_Disconnects gauge (OverlayImpl.h). A rising trend indicates network instability, aggressive peer management, or resource exhaustion causing connection drops.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_Overlay_Peer_Disconnects{exported_instance=~\"$node\"}",
"legendFormat": "Disconnects"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Disconnects",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Total Network Bytes",
"description": "Rate of total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h). Wrapped in rate() to show throughput rather than cumulative counter values.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_total_Bytes_In[5m])",
"legendFormat": "Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_total_Bytes_Out[5m])",
"legendFormat": "Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"axisLabel": "Throughput",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Total Network Messages",
"description": "Total messages sent and received across all peer connections. Sourced from the total.Messages_In and total.Messages_Out traffic category gauges (OverlayImpl.h). Shows the overall message throughput of the overlay network.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_total_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Messages In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_total_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Messages Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Traffic",
"description": "Bytes and messages for transaction-related overlay traffic. Includes the transactions traffic category (OverlayImpl/TrafficCount.h). Spikes indicate high transaction volume on the network or transaction flooding.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_transactions_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Messages In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_transactions_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "TX Messages Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_transactions_duplicate_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "TX Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Proposal Traffic",
"description": "Messages for consensus proposal overlay traffic. Includes proposals, proposals_untrusted, and proposals_duplicate categories (TrafficCount.h). High untrusted or duplicate counts may indicate UNL misconfiguration or network spam.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proposals_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Proposals In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proposals_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Proposals Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proposals_untrusted_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Untrusted In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proposals_duplicate_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation Traffic",
"description": "Messages for validation overlay traffic. Includes validations, validations_untrusted, and validations_duplicate categories (TrafficCount.h). Monitoring trusted vs untrusted validation traffic helps detect UNL health issues.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validations_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Validations In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validations_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Validations Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validations_untrusted_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Untrusted In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validations_duplicate_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Overlay Traffic by Category (Bytes In)",
"description": "Top traffic categories by inbound bytes. Includes all 57 overlay traffic categories from TrafficCount.h. Shows which protocol message types consume the most bandwidth. Categories include transactions, proposals, validations, ledger data, getobject, and overlay overhead.",
"type": "bargauge",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk{exported_instance=~\"$node\"}(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
"legendFormat": "{{__name__}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "rippled_transactions_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Transactions"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_proposals_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Proposals"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_validations_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Validations"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_overhead_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Overhead"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_overhead_overlay_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Overhead Overlay"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ping_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Ping"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_status_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Status"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_getObject_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Get Object"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_haveTxSet_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Have Tx Set"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledgerData_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Ledger Data"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Ledger Share"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_get_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Ledger Data Get"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Ledger Data Share"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Account_State_Node_get_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Account State Node Get"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Account_State_Node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Account State Node Share"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Node_get_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Transaction Node Get"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Transaction Node Share"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Tx Set Candidate Get"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Account_State_node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Account State Node Share (Legacy)"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Tx Set Candidate Share"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Transaction_node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Transaction Node Share (Legacy)"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_set_get_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Set Get"
}
]
}
]
}
},
{
"title": "Duplicate Traffic (Wasted Bandwidth)",
"description": "Rate of duplicate overlay traffic across transaction, proposal, and validation categories. Duplicate messages are messages the node has already seen and discards. High duplicate rates indicate inefficient message routing or network topology issues causing redundant relays.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_transactions_duplicate_Bytes_In[5m])",
"legendFormat": "TX Duplicate In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_transactions_duplicate_Bytes_Out[5m])",
"legendFormat": "TX Duplicate Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_proposals_duplicate_Bytes_In[5m])",
"legendFormat": "Proposals Duplicate In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_proposals_duplicate_Bytes_Out[5m])",
"legendFormat": "Proposals Duplicate Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_validations_duplicate_Bytes_In[5m])",
"legendFormat": "Validations Duplicate In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_validations_duplicate_Bytes_Out[5m])",
"legendFormat": "Validations Duplicate Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"axisLabel": "Throughput",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "All Traffic Categories (Detail)",
"description": "Top 15 traffic categories by inbound byte rate, excluding the total aggregate. Provides a detailed timeseries view of which overlay message types are consuming the most bandwidth over time. Complements the bar gauge snapshot view in the Overlay Traffic panel.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk{exported_instance=~\"$node\"}(15, rate({__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"}[5m]))",
"legendFormat": "{{__name__}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"custom": {
"axisLabel": "Throughput",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "network", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by xrpld node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Network Traffic (StatsD)",
"uid": "xrpld-statsd-network"
}

View File

@@ -0,0 +1,950 @@
{
"annotations": {
"list": []
},
"description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Validated Ledger Age",
"description": "Age of the most recently validated ledger in seconds. Sourced from the LedgerMaster.Validated_Ledger_Age gauge (LedgerMaster.h) which is updated every collection interval via the insight hook. Values above 20s indicate the node is falling behind the network.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_LedgerMaster_Validated_Ledger_Age{exported_instance=~\"$node\"}",
"legendFormat": "Validated Age"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 10
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
}
},
{
"title": "Published Ledger Age",
"description": "Age of the most recently published ledger in seconds. Sourced from the LedgerMaster.Published_Ledger_Age gauge (LedgerMaster.h). Published ledger age should track close to validated ledger age. A growing gap indicates publish pipeline backlog.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_LedgerMaster_Published_Ledger_Age{exported_instance=~\"$node\"}",
"legendFormat": "Published Age"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 10
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
}
},
{
"title": "Operating Mode Duration",
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp). A healthy node should spend the vast majority of time in Full mode.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Full_duration{exported_instance=~\"$node\"}",
"legendFormat": "Full"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Tracking_duration{exported_instance=~\"$node\"}",
"legendFormat": "Tracking"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Syncing_duration{exported_instance=~\"$node\"}",
"legendFormat": "Syncing"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Connected_duration{exported_instance=~\"$node\"}",
"legendFormat": "Connected"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Disconnected_duration{exported_instance=~\"$node\"}",
"legendFormat": "Disconnected"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"axisLabel": "Duration (Sec)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Operating Mode Transitions",
"description": "Count of transitions into each operating mode. Sourced from State_Accounting.*_transitions gauges (NetworkOPs.cpp). Frequent transitions out of Full mode indicate instability. Transitions to Disconnected or Syncing warrant investigation.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Full_transitions{exported_instance=~\"$node\"}",
"legendFormat": "Full"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Tracking_transitions{exported_instance=~\"$node\"}",
"legendFormat": "Tracking"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Syncing_transitions{exported_instance=~\"$node\"}",
"legendFormat": "Syncing"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Connected_transitions{exported_instance=~\"$node\"}",
"legendFormat": "Connected"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Disconnected_transitions{exported_instance=~\"$node\"}",
"legendFormat": "Disconnected"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Transitions",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "I/O Latency",
"description": "P95 and P50 of the I/O service loop latency in milliseconds. Sourced from the ios_latency event (Application.cpp) which measures how long it takes for the io_context to process a timer callback. Values above 10ms are logged; above 500ms trigger warnings. High values indicate thread pool saturation or blocking operations.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ios_latency{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95 I/O Latency"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ios_latency{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50 I/O Latency"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Job Queue Depth",
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_job_count{exported_instance=~\"$node\"}",
"legendFormat": "Job Queue Depth"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Jobs",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Fetch Rate",
"description": "Rate of ledger fetch requests initiated by the node. Sourced from the ledger_fetches counter (InboundLedgers.cpp) which increments each time the node requests a ledger from a peer. High rates indicate the node is catching up or missing ledgers.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_ledger_fetches_total[5m])",
"legendFormat": "Fetches / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Ledger History Mismatches",
"description": "Rate of ledger history hash mismatches. Sourced from the ledger.history.mismatch counter (LedgerHistory.cpp) which increments when a built ledger hash does not match the expected validated hash. Non-zero values indicate consensus divergence or database corruption.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_ledger_history_mismatch_total[5m])",
"legendFormat": "Mismatches / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 0.01
}
]
}
},
"overrides": []
}
},
{
"title": "Key Jobs Execution Time",
"description": "Execution time for critical job types at the selected quantile. Sourced from per-job-type events in JobTypeData (JobTypeData.h). Shows how long key consensus, transaction, and maintenance jobs take to execute. Spikes indicate processing bottlenecks.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_acceptLedger{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Accept Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_advanceLedger{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Advance Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_transaction{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Transaction [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_writeObjects{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Write Objects [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_heartbeat{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Heartbeat [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_sweep{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Sweep [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_trustedValidation{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Trusted Validation [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_trustedProposal{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Trusted Proposal [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_publishNewLedger{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Publish New Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_clientRPC{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Client RPC [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledgerData{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Ledger Data [{{quantile}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Key Jobs Dequeue Wait Time",
"description": "Time spent waiting in the job queue before execution for critical job types. Sourced from per-job-type dequeue events (JobTypeData.h). High dequeue times indicate the job queue is backlogged and jobs are waiting too long to be scheduled.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_acceptLedger_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Accept Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_advanceLedger_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Advance Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_transaction_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Transaction [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_writeObjects_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Write Objects [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_heartbeat_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Heartbeat [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_sweep_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Sweep [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_trustedValidation_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Trusted Validation [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_trustedProposal_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Trusted Proposal [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_publishNewLedger_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Publish New Ledger [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_clientRPC_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Client RPC [{{quantile}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ledgerData_q{exported_instance=~\"$node\", quantile=\"$quantile\"}",
"legendFormat": "Ledger Data [{{quantile}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Wait Time (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "FullBelowCache Size",
"description": "Number of entries in the FullBelowCache. Sourced from the TaggedCache size gauge (TaggedCache.h) for the Node family full below cache (NodeFamily.cpp). This cache tracks which SHAMap nodes have all children present locally, avoiding redundant fetches during ledger acquisition.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 40
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_Node_family_full_below_cache_size{exported_instance=~\"$node\"}",
"legendFormat": "FullBelowCache Size"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Entries",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "FullBelowCache Hit Rate",
"description": "Hit rate percentage for the FullBelowCache. Sourced from the TaggedCache hit_rate gauge (TaggedCache.h). A high hit rate means the node is efficiently reusing cached knowledge about complete SHAMap subtrees. Low hit rates during steady state warrant investigation.",
"type": "gauge",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 40
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_Node_family_full_below_cache_hit_rate{exported_instance=~\"$node\"}",
"legendFormat": "Hit Rate"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{
"color": "red",
"value": null
},
{
"color": "yellow",
"value": 25
},
{
"color": "green",
"value": 50
}
]
}
},
"overrides": []
}
},
{
"title": "Ledger Publish Gap",
"description": "Difference between published and validated ledger ages. Computed as Published_Ledger_Age minus Validated_Ledger_Age. A value near zero means the publish pipeline keeps up with validation. A growing gap indicates the publish pipeline is falling behind, potentially causing stale data for subscribers.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 48
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_LedgerMaster_Published_Ledger_Age{exported_instance=~\"$node\"} - rippled_LedgerMaster_Validated_Ledger_Age",
"legendFormat": "Publish Gap"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 5
},
{
"color": "red",
"value": 10
}
]
}
},
"overrides": []
}
},
{
"title": "State Duration Rate (Full vs Tracking)",
"description": "Rate of change of time spent in Full and Tracking operating modes, normalized to seconds. Sourced from State_Accounting duration gauges (NetworkOPs.cpp). In steady state the Full duration rate should be close to 1.0 (gaining one second of Full-mode time per wall-clock second). A drop below 1.0 means the node is spending time in other modes.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 48
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_State_Accounting_Full_duration[5m]) / 1000000",
"legendFormat": "Full Mode Rate"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_State_Accounting_Tracking_duration[5m]) / 1000000",
"legendFormat": "Tracking Mode Rate"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Rate (s/s)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "All Jobs Execution Time (Detail)",
"description": "Execution time for ALL non-special job types at the selected quantile. Shows the complete picture of job execution performance. Use the Key Jobs panel for a focused view of the most critical jobs.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 56
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "{__name__{exported_instance=~\"$node\"}=~\"rippled_(makeFetchPack|publishAcqLedger|untrustedValidation|manifest|localTransaction|ledgerReplayRequest|ledgerRequest|untrustedProposal|ledgerReplayTask|ledgerData|clientCommand|clientSubscribe|clientFeeChange|clientConsensus|clientAccountHistory|clientRPC|clientWebsocket|RPC|updatePaths|transaction|batch|advanceLedger|publishNewLedger|fetchTxnData|writeAhead|trustedValidation|writeObjects|acceptLedger|trustedProposal|sweep|clusterReport|heartbeat|administration|handleHaveTransactions|doTransactions)\", quantile=\"$quantile\"}",
"legendFormat": "{{__name__}} [{{quantile}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "All Jobs Dequeue Wait (Detail)",
"description": "Dequeue wait time for ALL non-special job types at the selected quantile. Shows the complete picture of job queue waiting times. High wait times across many job types indicate systemic job queue congestion.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 64
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "{__name__{exported_instance=~\"$node\"}=~\"rippled_(makeFetchPack_q|publishAcqLedger_q|untrustedValidation_q|manifest_q|localTransaction_q|ledgerReplayRequest_q|ledgerRequest_q|untrustedProposal_q|ledgerReplayTask_q|ledgerData_q|clientCommand_q|clientSubscribe_q|clientFeeChange_q|clientConsensus_q|clientAccountHistory_q|clientRPC_q|clientWebsocket_q|RPC_q|updatePaths_q|transaction_q|batch_q|advanceLedger_q|publishNewLedger_q|fetchTxnData_q|writeAhead_q|trustedValidation_q|writeObjects_q|acceptLedger_q|trustedProposal_q|sweep_q|clusterReport_q|heartbeat_q|administration_q|handleHaveTransactions_q|doTransactions_q)\", quantile=\"$quantile\"}",
"legendFormat": "{{__name__}} [{{quantile}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Wait Time (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "node-health", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by xrpld node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "quantile",
"label": "Quantile",
"type": "custom",
"query": "0.5,0.9,0.95,0.99",
"current": {
"selected": true,
"text": "0.95",
"value": "0.95"
},
"options": [
{
"selected": false,
"text": "0.5",
"value": "0.5"
},
{
"selected": false,
"text": "0.9",
"value": "0.9"
},
{
"selected": true,
"text": "0.95",
"value": "0.95"
},
{
"selected": false,
"text": "0.99",
"value": "0.99"
}
],
"multi": false,
"includeAll": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Node Health (StatsD)",
"uid": "xrpld-statsd-node-health"
}

View File

@@ -0,0 +1,587 @@
{
"annotations": {
"list": []
},
"description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Squelch Traffic (Messages)",
"description": "Squelch-related overlay messages. Squelch is the peer traffic management protocol that suppresses redundant message forwarding. 'squelch' = squelch control messages, 'squelch_suppressed' = messages suppressed by squelch, 'squelch_ignored' = squelch directives that were ignored. High suppressed counts indicate effective bandwidth savings; high ignored counts may indicate misconfigured peers. Sourced from TrafficCount.h squelch categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Squelch In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Squelch Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_suppressed_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Suppressed In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_suppressed_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Suppressed Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_ignored_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Ignored In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_squelch_ignored_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Ignored Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Overhead Traffic Breakdown (Bytes)",
"description": "Overlay protocol overhead by sub-category. 'overhead' = base protocol overhead (ping, status, etc.), 'overhead_cluster' = intra-cluster communication overhead, 'overhead_manifest' = validator manifest distribution overhead. High cluster overhead may indicate frequent cluster state syncs; high manifest overhead occurs during UNL changes. Sourced from TrafficCount.h overhead categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Base Overhead In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Base Overhead Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_cluster_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Cluster In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_cluster_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Cluster Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_manifest_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Manifest In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_overhead_manifest_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Manifest Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validator List Traffic",
"description": "Validator list (UNL) distribution traffic. Validator lists are exchanged when peers share their trusted validator configurations. Spikes occur during UNL updates or when new peers connect. Sourced from TrafficCount.h validator_lists category.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validator_lists_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validator_lists_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Bytes Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validator_lists_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Messages In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_validator_lists_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Messages Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Count",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": [
{
"matcher": {
"id": "byRegexp",
"options": "/Bytes/"
},
"properties": [
{
"id": "custom.axisPlacement",
"value": "right"
},
{
"id": "unit",
"value": "decbytes"
}
]
}
]
}
},
{
"title": "Set Get/Share Traffic (Bytes)",
"description": "Transaction set get and share traffic. 'set_get' = requests to fetch transaction sets (sent during ledger close), 'set_share' = responses sharing transaction sets. High set_get traffic indicates peers frequently requesting missing transaction sets, which may signal sync delays. Sourced from TrafficCount.h set_get/set_share categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_set_get_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Set Get In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_set_get_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Set Get Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_set_share_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Set Share In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_set_share_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Set Share Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Have/Requested Transactions (Messages)",
"description": "Transaction availability protocol messages. 'have_transactions' = advertisements that a peer has specific transactions available, 'requested_transactions' = explicit requests for transaction data. A high ratio of requested to have may indicate peers are behind on transaction propagation. Sourced from TrafficCount.h have_transactions/requested_transactions categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_have_transactions_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Have TX In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_have_transactions_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Have TX Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_requested_transactions_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Requested TX In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_requested_transactions_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Requested TX Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Unknown / Unclassified Traffic",
"description": "Traffic that does not match any known overlay message category. Non-zero values may indicate protocol version mismatches, corrupted messages, or new message types not yet classified. Sourced from TrafficCount.h unknown category.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_unknown_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Unknown Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_unknown_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Unknown Bytes Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_unknown_Messages_In{exported_instance=~\"$node\"}",
"legendFormat": "Unknown Messages In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_unknown_Messages_Out{exported_instance=~\"$node\"}",
"legendFormat": "Unknown Messages Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Count",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": [
{
"matcher": {
"id": "byRegexp",
"options": "/Bytes/"
},
"properties": [
{
"id": "custom.axisPlacement",
"value": "right"
},
{
"id": "unit",
"value": "decbytes"
}
]
}
]
}
},
{
"title": "Proof Path Traffic",
"description": "Proof path request/response traffic for ledger state proof exchange. Used by peers to verify specific ledger entries without downloading the full ledger. High request volume may indicate peers validating state during catch-up. Sourced from TrafficCount.h proof_path_request/proof_path_response categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proof_path_request_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Request Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proof_path_request_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Request Bytes Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proof_path_response_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Response Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_proof_path_response_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Response Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Replay Delta Traffic",
"description": "Replay delta request/response traffic for ledger replay protocol. Used during catch-up to efficiently replay ledger state changes. Sourced from TrafficCount.h replay_delta_request/replay_delta_response categories.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_replay_delta_request_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Request Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_replay_delta_request_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Request Bytes Out"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_replay_delta_response_Bytes_In{exported_instance=~\"$node\"}",
"legendFormat": "Response Bytes In"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_replay_delta_response_Bytes_Out{exported_instance=~\"$node\"}",
"legendFormat": "Response Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "overlay", "network", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by xrpld node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Overlay Traffic Detail (StatsD)",
"uid": "xrpld-statsd-overlay-detail"
}

View File

@@ -0,0 +1,417 @@
{
"annotations": {
"list": []
},
"description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "RPC Request Rate (StatsD)",
"description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_rpc_requests_total[5m])",
"legendFormat": "Requests / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
},
"overrides": []
}
},
{
"title": "RPC Response Time (StatsD)",
"description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95 Response Time"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50 Response Time"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Response Size",
"description": "P95 and P50 of RPC response payload size in bytes. Sourced from the rpc.size event (ServerHandler.cpp) which records the byte length of each RPC JSON response. Large responses may indicate expensive queries (e.g. account_tx with many results) or API misuse.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_size{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95 Response Size"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_size{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50 Response Size"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Size (Bytes)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Response Time Distribution",
"description": "Distribution of RPC response times from the beast::insight timer showing P50, P90, P95, and P99 quantiles. Sourced from the rpc.time event (ServerHandler.cpp). Useful for detecting bimodal latency or long-tail requests.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.9\"}",
"legendFormat": "P90"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{exported_instance=~\"$node\", quantile=\"0.99\"}",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Pathfinding Fast Duration",
"description": "P95 and P50 of fast pathfinding execution time. Sourced from the pathfind_fast event (PathRequests.h) which records the duration of the fast pathfinding algorithm. Fast pathfinding uses a simplified search that trades accuracy for speed.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_fast{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95 Fast Pathfind"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_fast{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50 Fast Pathfind"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Pathfinding Full Duration",
"description": "P95 and P50 of full pathfinding execution time. Sourced from the pathfind_full event (PathRequests.h) which records the duration of the exhaustive pathfinding search. Full pathfinding is more expensive and can take significantly longer than fast mode.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_full{exported_instance=~\"$node\", quantile=\"0.95\"}",
"legendFormat": "P95 Full Pathfind"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_full{exported_instance=~\"$node\", quantile=\"0.5\"}",
"legendFormat": "P50 Full Pathfind"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Resource Warnings Rate",
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp (Phase 6 Task 6.1).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_warn_total[5m])",
"legendFormat": "Warnings / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.1
},
{
"color": "red",
"value": 1
}
]
}
},
"overrides": []
}
},
{
"title": "Resource Drops Rate",
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp (Phase 6 Task 6.1).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate{exported_instance=~\"$node\"}(rippled_drop_total[5m])",
"legendFormat": "Drops / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.01
},
{
"color": "red",
"value": 0.1
}
]
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by xrpld node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "RPC & Pathfinding (StatsD)",
"uid": "xrpld-statsd-rpc"
}

View File

@@ -0,0 +1,552 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Transaction Processing Rate",
"description": "Rate of transactions entering the processing pipeline. tx.process (NetworkOPs.cpp) fires when a transaction is submitted locally or received from a peer and enters processTransaction(). tx.receive (PeerImp.cpp) fires when a raw transaction message arrives from a peer before deduplication.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.process\"}[5m]))",
"legendFormat": "tx.process / Sec [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.receive\"}[5m]))",
"legendFormat": "tx.receive / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Transactions / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Processing Latency",
"description": "p95 and p50 latency of transaction processing (tx.process span). Measures the time from when a transaction enters processTransaction() to completion. Computed via histogram_quantile() over the spanmetrics duration histogram with a 5m rate window.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.process\"}[5m])))",
"legendFormat": "P95 [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.process\"}[5m])))",
"legendFormat": "P50 [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Path Distribution",
"description": "Breakdown of transactions by origin path. The local attribute indicates whether the transaction was submitted locally (true) or received from a peer (false). Helps understand the ratio of locally-originated vs relayed transactions.",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (local, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", local=~\"$tx_origin\", span_name=\"tx.process\"}[5m]))",
"legendFormat": "Local = {{local}} [{{exported_instance}}]"
}
]
},
{
"title": "Transaction Receive vs Suppressed",
"description": "Total rate of raw transaction messages received from peers (tx.receive span from PeerImp.cpp). This fires before deduplication via the HashRouter, so the difference between tx.receive and tx.process reflects suppressed duplicate transactions.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (suppressed, exported_instance) (rate(traces_span_metrics_calls_total{span_name=\"tx.receive\", exported_instance=~\"$node\"}[$__rate_interval]))",
"legendFormat": "Suppressed={{suppressed}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Transactions / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Processing Duration Heatmap",
"description": "Heatmap showing the distribution of tx.process span durations across histogram buckets over time. Each cell represents the count of transactions that completed within that latency bucket in a 5m window. Reveals whether processing times are consistent or exhibit multi-modal patterns.",
"type": "heatmap",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"yAxis": {
"axisLabel": "Duration (ms)"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.process\"}[5m])) by (le)",
"legendFormat": "{{le}}",
"format": "heatmap"
}
]
},
{
"title": "Transaction Apply Duration per Ledger",
"description": "p95 and p50 latency of applying the consensus transaction set to a new ledger. The tx.apply span (BuildLedger.cpp) wraps the applyTransactions() function that iterates through the CanonicalTXSet and applies each transaction to the OpenView. Long durations indicate heavy transaction sets or expensive transaction processing.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.apply\"}[5m])))",
"legendFormat": "P95 tx.apply [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.apply\"}[5m])))",
"legendFormat": "P50 tx.apply [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Peer Transaction Receive Rate",
"description": "Rate of transaction messages received from network peers. Sourced from the tx.receive span (PeerImp.cpp) which fires in the onMessage(TMTransaction) handler. High rates may indicate network-wide transaction volume spikes or peer flooding.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.receive\"}[5m]))",
"legendFormat": "tx.receive / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Transactions / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Apply Failed Rate",
"description": "Rate of tx.apply spans completing with error status, indicating transaction application failures during ledger building. The span records tx_failed as an attribute. Thresholds: green < 0.1/sec, yellow 0.1-1/sec, red > 1/sec. Some failures are normal (e.g. conflicting offers) but sustained high rates may indicate issues.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.apply\", status_code=\"STATUS_CODE_ERROR\"}[5m]))",
"legendFormat": "Failed / Sec [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.1
},
{
"color": "red",
"value": 1
}
]
}
},
"overrides": []
}
},
{
"title": "Transaction Rate by Type",
"description": "Transaction processing rate broken down by tx_type (Payment, OfferCreate, AMMDeposit, etc.). Requires tx_type dimension in spanmetrics.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (tx_type) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.process\", tx_type!=\"\"}[5m]))",
"legendFormat": "{{tx_type}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "TX / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Results by Type",
"description": "Transaction result codes (ter_result) broken down by tx_type. Shows which transaction types fail most often.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 32
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (tx_type, ter_result) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"tx.process\", ter_result!=\"\", ter_result!=\"tesSUCCESS\"}[5m]))",
"legendFormat": "{{tx_type}}: {{ter_result}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Failed TX / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "TxQ Accept Status",
"description": "TxQ accept outcomes: applied (included in ledger), failed (removed), retried (kept for next round).",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 40
},
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
},
"tooltip": {
"mode": "multi"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (txq_status) (increase(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=\"txq.accept_tx\", txq_status!=\"\"}[5m]))",
"legendFormat": "{{txq_status}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short"
},
"overrides": []
}
},
{
"title": "Transactor Duration by Type (p95)",
"description": "Per-transactor execution time (tx.transactor span). Shows which transaction types are most expensive to execute.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 16,
"x": 8,
"y": 40
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, tx_type) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=\"tx.transactor\", tx_type!=\"\"}[5m])))",
"legendFormat": "p95 {{tx_type}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "transactions", "telemetry"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total, exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "tx_origin",
"label": "TX Origin",
"description": "Filter by transaction origin (true = local submit, false = peer relay)",
"type": "query",
"query": "label_values(traces_span_metrics_calls_total{span_name=\"tx.process\"}, local)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "Transaction Overview",
"uid": "xrpld-transactions"
}

View File

@@ -0,0 +1,12 @@
apiVersion: 1
providers:
- name: xrpld-telemetry
orgId: 1
folder: xrpld
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false

View File

@@ -0,0 +1,10 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true

View File

@@ -40,9 +40,9 @@ datasources:
operator: "="
scope: resource
type: static
# service.instance.id: unique node identifier — defaults to the
# node's public key (e.g., nHB1X37...). Distinguishes individual
# nodes in a multi-node cluster or network.
# service.instance.id: unique node identifier — configurable via
# the service_instance_id setting in [telemetry], defaults to the
# node's public key. E.g. "Node-1" or "nHB1X37...".
- id: node-id
tag: service.instance.id
operator: "="
@@ -155,3 +155,54 @@ datasources:
operator: "="
scope: span
type: dynamic
- id: consensus-proposers
tag: proposers
operator: "="
scope: span
type: dynamic
- id: consensus-result
tag: consensus_result
operator: "="
scope: span
type: dynamic
- id: consensus-mode-old
tag: mode_old
operator: "="
scope: span
type: dynamic
- id: consensus-mode-new
tag: mode_new
operator: "="
scope: span
type: dynamic
- id: consensus-ledger-id
tag: xrpl.consensus.ledger_id
operator: "="
scope: span
type: static
# Phase 3/4: Additional transaction and queue filters
- id: tx-path
tag: path
operator: "="
scope: span
type: dynamic
- id: tx-suppressed
tag: suppressed
operator: "="
scope: span
type: dynamic
- id: peer-version
tag: peer_version
operator: "="
scope: span
type: dynamic
- id: txq-status
tag: txq_status
operator: "="
scope: span
type: dynamic
- id: txq-ter-code
tag: ter_code
operator: "="
scope: span
type: dynamic

View File

@@ -0,0 +1,612 @@
#!/usr/bin/env bash
# Integration test for rippled OpenTelemetry instrumentation.
#
# Launches a 6-node xrpld consensus network with telemetry enabled,
# exercises RPC / transaction / consensus code paths, then verifies
# that the expected spans and metrics appear in Tempo and Prometheus.
#
# Usage:
# bash docker/telemetry/integration-test.sh
#
# Prerequisites:
# - .build/xrpld built with telemetry=ON
# - docker compose (v2)
# - curl, jq
#
# The script leaves the observability stack and xrpld nodes running
# so you can manually inspect Tempo (localhost:3200) and Grafana
# (localhost:3000). Run with --cleanup to tear down instead.
set -euo pipefail
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
XRPLD="$REPO_ROOT/.build/xrpld"
COMPOSE_FILE="$SCRIPT_DIR/docker-compose.yml"
STANDALONE_CFG="$SCRIPT_DIR/xrpld-telemetry.cfg"
WORKDIR="/tmp/xrpld-integration"
NUM_NODES=6
PEER_PORT_BASE=51235
RPC_PORT_BASE=5005
CONSENSUS_TIMEOUT=120
GENESIS_ACCOUNT="rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"
GENESIS_SEED="snoPBrXtMeMyMHUVTgbuqAfg1SUTb"
DEST_ACCOUNT="" # Generated dynamically via wallet_propose
TEMPO="http://localhost:3200"
PROM="http://localhost:9090"
# Counters for pass/fail
PASS=0
FAIL=0
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
log() { printf "\033[1;34m[INFO]\033[0m %s\n" "$*"; }
ok() {
printf "\033[1;32m[PASS]\033[0m %s\n" "$*"
PASS=$((PASS + 1))
}
fail() {
printf "\033[1;31m[FAIL]\033[0m %s\n" "$*"
FAIL=$((FAIL + 1))
}
die() {
printf "\033[1;31m[ERROR]\033[0m %s\n" "$*" >&2
exit 1
}
check_span() {
local op="$1"
local count
count=$(curl -sf "$TEMPO/api/search" \
--data-urlencode "q={resource.service.name=\"rippled\" && name=\"$op\"}" \
--data-urlencode "limit=5" |
jq '.traces | length' 2>/dev/null || echo 0)
if [ "$count" -gt 0 ]; then
ok "$op ($count traces)"
else
fail "$op (0 traces)"
fi
}
cleanup() {
log "Cleaning up..."
# Kill xrpld nodes
for i in $(seq 1 "$NUM_NODES"); do
local pidfile="$WORKDIR/node$i/xrpld.pid"
if [ -f "$pidfile" ]; then
kill "$(cat "$pidfile")" 2>/dev/null || true
rm -f "$pidfile"
fi
done
# Also kill any straggling xrpld processes from our workdir
pkill -f "$WORKDIR" 2>/dev/null || true
# Stop docker stack
docker compose -f "$COMPOSE_FILE" down 2>/dev/null || true
# Remove workdir
rm -rf "$WORKDIR"
log "Cleanup complete."
}
# Handle --cleanup flag
if [ "${1:-}" = "--cleanup" ]; then
cleanup
exit 0
fi
# ---------------------------------------------------------------------------
# Step 0: Prerequisites
# ---------------------------------------------------------------------------
log "Checking prerequisites..."
command -v docker >/dev/null 2>&1 || die "docker not found"
docker compose version >/dev/null 2>&1 || die "docker compose (v2) not found"
command -v curl >/dev/null 2>&1 || die "curl not found"
command -v jq >/dev/null 2>&1 || die "jq not found"
[ -x "$XRPLD" ] || die "xrpld binary not found at $XRPLD (build with telemetry=ON)"
[ -f "$COMPOSE_FILE" ] || die "docker-compose.yml not found at $COMPOSE_FILE"
[ -f "$STANDALONE_CFG" ] || die "xrpld-telemetry.cfg not found at $STANDALONE_CFG"
log "All prerequisites met."
# ---------------------------------------------------------------------------
# Step 1: Clean previous run
# ---------------------------------------------------------------------------
log "Cleaning previous run data..."
for i in $(seq 1 "$NUM_NODES"); do
pidfile="$WORKDIR/node$i/xrpld.pid"
if [ -f "$pidfile" ]; then
kill "$(cat "$pidfile")" 2>/dev/null || true
fi
done
pkill -f "$WORKDIR" 2>/dev/null || true
# Kill any xrpld using the standalone config (from key generation)
pkill -f "xrpld-telemetry.cfg" 2>/dev/null || true
sleep 2
rm -rf "$WORKDIR"
mkdir -p "$WORKDIR"
# ---------------------------------------------------------------------------
# Step 2: Start observability stack
# ---------------------------------------------------------------------------
log "Starting observability stack..."
docker compose -f "$COMPOSE_FILE" up -d
log "Waiting for otel-collector to be ready..."
for attempt in $(seq 1 30); do
# The OTLP HTTP endpoint returns 405 for GET (expects POST), which
# means it is listening. curl -sf would fail on 405, so we check
# the HTTP status code explicitly.
status=$(curl -so /dev/null -w '%{http_code}' http://localhost:4318/ 2>/dev/null || echo 000)
if [ "$status" != "000" ]; then
log "otel-collector ready (attempt $attempt, HTTP $status)."
break
fi
if [ "$attempt" -eq 30 ]; then
die "otel-collector not ready after 30s"
fi
sleep 1
done
log "Waiting for Tempo to be ready..."
for attempt in $(seq 1 30); do
if curl -sf "$TEMPO/ready" >/dev/null 2>&1; then
log "Tempo ready (attempt $attempt)."
break
fi
if [ "$attempt" -eq 30 ]; then
die "Tempo not ready after 30s"
fi
sleep 1
done
# ---------------------------------------------------------------------------
# Step 3: Generate validator keys
# ---------------------------------------------------------------------------
log "Generating $NUM_NODES validator key pairs..."
# Start a temporary standalone xrpld for key generation
TEMP_DATA="$WORKDIR/temp-keygen"
mkdir -p "$TEMP_DATA"
# Create a minimal temp config for key generation
TEMP_CFG="$TEMP_DATA/xrpld.cfg"
cat >"$TEMP_CFG" <<EOCFG
[server]
port_rpc_temp
[port_rpc_temp]
port = 5099
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http
[node_db]
type=NuDB
path=$TEMP_DATA/nudb
online_delete=256
[database_path]
$TEMP_DATA/db
[debug_logfile]
$TEMP_DATA/debug.log
[ssl_verify]
0
EOCFG
"$XRPLD" --conf "$TEMP_CFG" -a --start >"$TEMP_DATA/stdout.log" 2>&1 &
TEMP_PID=$!
log "Temporary xrpld started (PID $TEMP_PID), waiting for RPC..."
for attempt in $(seq 1 30); do
if curl -sf http://localhost:5099 -d '{"method":"server_info"}' >/dev/null 2>&1; then
log "Temporary xrpld RPC ready (attempt $attempt)."
break
fi
if [ "$attempt" -eq 30 ]; then
kill "$TEMP_PID" 2>/dev/null || true
die "Temporary xrpld RPC not ready after 30s"
fi
sleep 1
done
declare -a SEEDS
declare -a PUBKEYS
for i in $(seq 1 "$NUM_NODES"); do
result=$(curl -sf http://localhost:5099 -d '{"method":"validation_create"}')
seed=$(echo "$result" | jq -r '.result.validation_seed')
pubkey=$(echo "$result" | jq -r '.result.validation_public_key')
if [ -z "$seed" ] || [ "$seed" = "null" ]; then
kill "$TEMP_PID" 2>/dev/null || true
die "Failed to generate key pair $i"
fi
SEEDS+=("$seed")
PUBKEYS+=("$pubkey")
log " Node $i: $pubkey"
done
kill "$TEMP_PID" 2>/dev/null || true
wait "$TEMP_PID" 2>/dev/null || true
rm -rf "$TEMP_DATA"
log "Key generation complete."
# ---------------------------------------------------------------------------
# Step 4: Generate node configs and validators.txt
# ---------------------------------------------------------------------------
log "Generating node configs..."
# Create shared validators.txt
VALIDATORS_FILE="$WORKDIR/validators.txt"
{
echo "[validators]"
for i in $(seq 0 $((NUM_NODES - 1))); do
echo "${PUBKEYS[$i]}"
done
} >"$VALIDATORS_FILE"
# Create per-node configs
for i in $(seq 1 "$NUM_NODES"); do
NODE_DIR="$WORKDIR/node$i"
mkdir -p "$NODE_DIR/nudb" "$NODE_DIR/db"
RPC_PORT=$((RPC_PORT_BASE + i - 1))
PEER_PORT=$((PEER_PORT_BASE + i - 1))
SEED="${SEEDS[$((i - 1))]}"
# Build ips_fixed list (all peers except self)
IPS_FIXED=""
for j in $(seq 1 "$NUM_NODES"); do
if [ "$j" -ne "$i" ]; then
IPS_FIXED="${IPS_FIXED}127.0.0.1 $((PEER_PORT_BASE + j - 1))
"
fi
done
cat >"$NODE_DIR/xrpld.cfg" <<EOCFG
[server]
port_rpc
port_peer
[port_rpc]
port = $RPC_PORT
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http
[port_peer]
port = $PEER_PORT
ip = 0.0.0.0
protocol = peer
[node_db]
type=NuDB
path=$NODE_DIR/nudb
online_delete=256
[database_path]
$NODE_DIR/db
[debug_logfile]
$NODE_DIR/debug.log
[validation_seed]
$SEED
[validators_file]
$VALIDATORS_FILE
[ips_fixed]
${IPS_FIXED}
[peer_private]
1
[telemetry]
enabled=1
service_instance_id=Node-${i}
endpoint=http://localhost:4318/v1/traces
exporter=otlp_http
sampling_ratio=1.0
batch_size=512
batch_delay_ms=2000
max_queue_size=2048
trace_rpc=1
trace_transactions=1
trace_consensus=1
trace_peer=1
trace_ledger=1
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
[rpc_startup]
{ "command": "log_level", "severity": "warning" }
[ssl_verify]
0
EOCFG
log " Node $i config: RPC=$RPC_PORT, Peer=$PEER_PORT"
done
# ---------------------------------------------------------------------------
# Step 5: Start all 6 nodes
# ---------------------------------------------------------------------------
log "Starting $NUM_NODES xrpld nodes..."
for i in $(seq 1 "$NUM_NODES"); do
NODE_DIR="$WORKDIR/node$i"
"$XRPLD" --conf "$NODE_DIR/xrpld.cfg" --start >"$NODE_DIR/stdout.log" 2>&1 &
echo $! >"$NODE_DIR/xrpld.pid"
log " Node $i started (PID $(cat "$NODE_DIR/xrpld.pid"))"
done
# Give nodes a moment to initialize
sleep 5
# ---------------------------------------------------------------------------
# Step 6: Wait for consensus
# ---------------------------------------------------------------------------
log "Waiting for nodes to reach 'proposing' state (timeout: ${CONSENSUS_TIMEOUT}s)..."
start_time=$(date +%s)
nodes_ready=0
while [ "$nodes_ready" -lt "$NUM_NODES" ]; do
elapsed=$(($(date +%s) - start_time))
if [ "$elapsed" -ge "$CONSENSUS_TIMEOUT" ]; then
fail "Consensus timeout after ${CONSENSUS_TIMEOUT}s ($nodes_ready/$NUM_NODES nodes ready)"
log "Continuing with partial consensus..."
break
fi
nodes_ready=0
for i in $(seq 1 "$NUM_NODES"); do
RPC_PORT=$((RPC_PORT_BASE + i - 1))
state=$(curl -sf "http://localhost:$RPC_PORT" \
-d '{"method":"server_info"}' 2>/dev/null |
jq -r '.result.info.server_state' 2>/dev/null || echo "unreachable")
if [ "$state" = "proposing" ]; then
nodes_ready=$((nodes_ready + 1))
fi
done
printf "\r %d/%d nodes proposing (%ds elapsed)..." "$nodes_ready" "$NUM_NODES" "$elapsed"
if [ "$nodes_ready" -lt "$NUM_NODES" ]; then
sleep 3
fi
done
echo ""
if [ "$nodes_ready" -eq "$NUM_NODES" ]; then
ok "All $NUM_NODES nodes reached 'proposing' state"
else
fail "Only $nodes_ready/$NUM_NODES nodes reached 'proposing' state"
fi
# ---------------------------------------------------------------------------
# Step 6b: Wait for validated ledger
# ---------------------------------------------------------------------------
log "Waiting for first validated ledger..."
for attempt in $(seq 1 60); do
val_seq=$(curl -sf "http://localhost:$RPC_PORT_BASE" \
-d '{"method":"server_info"}' 2>/dev/null |
jq -r '.result.info.validated_ledger.seq // 0' 2>/dev/null || echo 0)
if [ "$val_seq" -gt 2 ] 2>/dev/null; then
ok "First validated ledger: seq $val_seq"
break
fi
if [ "$attempt" -eq 60 ]; then
fail "No validated ledger after 60s"
fi
sleep 1
done
# ---------------------------------------------------------------------------
# Step 7: Exercise RPC spans (Phase 2)
# ---------------------------------------------------------------------------
log "Exercising RPC spans..."
curl -sf "http://localhost:$RPC_PORT_BASE" \
-d '{"method":"server_info"}' >/dev/null
curl -sf "http://localhost:$RPC_PORT_BASE" \
-d '{"method":"server_state"}' >/dev/null
curl -sf "http://localhost:$RPC_PORT_BASE" \
-d '{"method":"ledger","params":[{"ledger_index":"current"}]}' >/dev/null
log "RPC commands sent. Waiting 5s for batch export..."
sleep 5
# ---------------------------------------------------------------------------
# Step 8: Submit transaction (Phase 3)
# ---------------------------------------------------------------------------
log "Submitting Payment transaction..."
# Generate a destination wallet
log " Generating destination wallet..."
wallet_result=$(curl -sf "http://localhost:$RPC_PORT_BASE" \
-d '{"method":"wallet_propose"}')
DEST_ACCOUNT=$(echo "$wallet_result" | jq -r '.result.account_id' 2>/dev/null)
if [ -z "$DEST_ACCOUNT" ] || [ "$DEST_ACCOUNT" = "null" ]; then
fail "Could not generate destination wallet"
DEST_ACCOUNT="rrrrrrrrrrrrrrrrrrrrrhoLvTp" # ACCOUNT_ZERO fallback
fi
log " Destination: $DEST_ACCOUNT"
# Get genesis account info
acct_result=$(curl -sf "http://localhost:$RPC_PORT_BASE" \
-d "{\"method\":\"account_info\",\"params\":[{\"account\":\"$GENESIS_ACCOUNT\"}]}")
seq_num=$(echo "$acct_result" | jq -r '.result.account_data.Sequence' 2>/dev/null || echo "unknown")
log " Genesis account sequence: $seq_num"
# Submit payment
submit_result=$(curl -sf "http://localhost:$RPC_PORT_BASE" \
-d "{\"method\":\"submit\",\"params\":[{\"secret\":\"$GENESIS_SEED\",\"tx_json\":{\"TransactionType\":\"Payment\",\"Account\":\"$GENESIS_ACCOUNT\",\"Destination\":\"$DEST_ACCOUNT\",\"Amount\":\"10000000\"}}]}")
engine_result=$(echo "$submit_result" | jq -r '.result.engine_result' 2>/dev/null || echo "unknown")
tx_hash=$(echo "$submit_result" | jq -r '.result.tx_json.hash' 2>/dev/null || echo "unknown")
if [ "$engine_result" = "tesSUCCESS" ] || [ "$engine_result" = "terQUEUED" ]; then
ok "Transaction submitted: $engine_result (hash: ${tx_hash:0:16}...)"
else
fail "Transaction submission: $engine_result"
log " Full response: $(echo "$submit_result" | jq -c .result 2>/dev/null)"
fi
log "Waiting 15s for consensus round + batch export..."
sleep 15
# ---------------------------------------------------------------------------
# Step 9: Verify Tempo traces
# ---------------------------------------------------------------------------
log "Verifying spans in Tempo..."
# Check service registration
services=$(curl -sf "$TEMPO/api/v2/search/tag/resource.service.name/values" |
jq -r '.tagValues[].value' 2>/dev/null || echo "")
if echo "$services" | grep -q "rippled"; then
ok "Service 'rippled' registered in Tempo"
else
fail "Service 'rippled' NOT found in Tempo (found: $services)"
fi
log ""
log "--- Phase 2: RPC Spans ---"
check_span "rpc.request"
check_span "rpc.process"
check_span "rpc.command.server_info"
check_span "rpc.command.server_state"
check_span "rpc.command.ledger"
log ""
log "--- Phase 3: Transaction Spans ---"
check_span "tx.process"
check_span "tx.receive"
check_span "tx.apply"
log ""
log "--- Phase 4: Consensus Spans ---"
check_span "consensus.proposal.send"
check_span "consensus.ledger_close"
check_span "consensus.accept"
check_span "consensus.validation.send"
log ""
log "--- Phase 5: Ledger Spans ---"
check_span "ledger.build"
check_span "ledger.validate"
check_span "ledger.store"
log ""
log "--- Phase 5: Peer Spans (trace_peer=1) ---"
check_span "peer.proposal.receive"
check_span "peer.validation.receive"
# ---------------------------------------------------------------------------
# Step 10: Verify Prometheus spanmetrics
# ---------------------------------------------------------------------------
log ""
log "--- Phase 5: Spanmetrics ---"
log "Waiting 20s for Prometheus scrape cycle..."
sleep 20
calls_count=$(curl -sf "$PROM/api/v1/query?query=traces_span_metrics_calls_total" |
jq '.data.result | length' 2>/dev/null || echo 0)
if [ "$calls_count" -gt 0 ]; then
ok "Prometheus: traces_span_metrics_calls_total ($calls_count series)"
else
fail "Prometheus: traces_span_metrics_calls_total (0 series)"
fi
duration_count=$(curl -sf "$PROM/api/v1/query?query=traces_span_metrics_duration_milliseconds_count" |
jq '.data.result | length' 2>/dev/null || echo 0)
if [ "$duration_count" -gt 0 ]; then
ok "Prometheus: duration histogram ($duration_count series)"
else
fail "Prometheus: duration histogram (0 series)"
fi
# Check Grafana
if curl -sf http://localhost:3000/api/health >/dev/null 2>&1; then
ok "Grafana: healthy at localhost:3000"
else
fail "Grafana: not reachable at localhost:3000"
fi
# ---------------------------------------------------------------------------
# Step 10b: Verify StatsD metrics in Prometheus
# ---------------------------------------------------------------------------
log ""
log "--- Phase 6: StatsD Metrics (beast::insight) ---"
log "Waiting 20s for StatsD aggregation + Prometheus scrape..."
sleep 20
check_statsd_metric() {
local metric_name="$1"
local result
result=$(curl -sf "$PROM/api/v1/query?query=$metric_name" |
jq '.data.result | length' 2>/dev/null || echo 0)
if [ "$result" -gt 0 ]; then
ok "StatsD: $metric_name ($result series)"
else
fail "StatsD: $metric_name (0 series)"
fi
}
# Node health gauges
check_statsd_metric "rippled_LedgerMaster_Validated_Ledger_Age"
check_statsd_metric "rippled_LedgerMaster_Published_Ledger_Age"
check_statsd_metric "rippled_job_count"
# State accounting
check_statsd_metric "rippled_State_Accounting_Full_duration"
# Peer finder
check_statsd_metric "rippled_Peer_Finder_Active_Inbound_Peers"
check_statsd_metric "rippled_Peer_Finder_Active_Outbound_Peers"
# RPC counters (only if RPC was exercised — should be true from Steps 5-8)
check_statsd_metric "rippled_rpc_requests"
# Overlay traffic
check_statsd_metric "rippled_total_Bytes_In"
# ---------------------------------------------------------------------------
# Step 11: Summary
# ---------------------------------------------------------------------------
echo ""
echo "==========================================================="
echo " INTEGRATION TEST RESULTS"
echo "==========================================================="
printf " \033[1;32mPASSED: %d\033[0m\n" "$PASS"
printf " \033[1;31mFAILED: %d\033[0m\n" "$FAIL"
echo "==========================================================="
echo ""
echo " Observability stack is running:"
echo ""
echo " Tempo: http://localhost:3200"
echo " Grafana: http://localhost:3000"
echo " Prometheus: http://localhost:9090"
echo ""
echo " xrpld nodes (6) are running:"
for i in $(seq 1 "$NUM_NODES"); do
RPC_PORT=$((RPC_PORT_BASE + i - 1))
PEER_PORT=$((PEER_PORT_BASE + i - 1))
echo " Node $i: RPC=localhost:$RPC_PORT Peer=:$PEER_PORT PID=$(cat "$WORKDIR/node$i/xrpld.pid" 2>/dev/null || echo 'unknown')"
done
echo ""
echo " To tear down:"
echo " bash docker/telemetry/integration-test.sh --cleanup"
echo ""
echo "==========================================================="
if [ "$FAIL" -gt 0 ]; then
exit 1
fi

View File

@@ -1,9 +1,16 @@
# OpenTelemetry Collector configuration for xrpld development.
#
# Pipeline: OTLP receiver -> batch processor -> debug + Tempo.
# Pipelines:
# traces: OTLP receiver -> batch processor -> debug + Tempo + spanmetrics
# metrics: StatsD receiver + spanmetrics connector -> Prometheus exporter
#
# xrpld sends traces via OTLP/HTTP to port 4318. The collector batches
# them and forwards to Tempo via OTLP/gRPC on the Docker network. Tempo
# is queryable via Grafana Explore using TraceQL.
# them, forwards to Tempo, and derives RED metrics via the spanmetrics
# connector, which Prometheus scrapes on port 8889.
#
# xrpld's beast::insight framework sends StatsD UDP metrics to port 8125.
# The StatsD receiver aggregates them and exports to Prometheus alongside
# the span-derived metrics.
receivers:
otlp:
@@ -12,12 +19,51 @@ receivers:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
statsd:
endpoint: "0.0.0.0:8125"
aggregation_interval: 15s
enable_metric_type: true
is_monotonic_counter: true
timer_histogram_mapping:
- statsd_type: "timing"
observer_type: "summary"
summary:
percentiles: [0, 50, 90, 95, 99, 100]
- statsd_type: "histogram"
observer_type: "summary"
summary:
percentiles: [0, 50, 90, 95, 99, 100]
processors:
batch:
timeout: 1s
send_batch_size: 100
connectors:
spanmetrics:
# Expose service.instance.id (node public key) as a Prometheus label so
# Grafana dashboards can filter metrics by individual node.
resource_metrics_key_attributes:
- service.instance.id
histogram:
explicit:
buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
dimensions:
- name: command
- name: rpc_status
- name: xrpl.consensus.mode
- name: close_time_correct
- name: local
- name: suppressed
- name: proposal_trusted
- name: validation_trusted
- name: tx_type
- name: ter_result
- name: txq_status
- name: consensus_state
- name: load_type
- name: is_batch
exporters:
debug:
verbosity: detailed
@@ -25,6 +71,8 @@ exporters:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
extensions:
health_check:
@@ -36,4 +84,7 @@ service:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug, otlp/tempo]
exporters: [debug, otlp/tempo, spanmetrics]
metrics:
receivers: [spanmetrics, statsd]
exporters: [prometheus]

View File

@@ -0,0 +1,9 @@
# Prometheus configuration for scraping spanmetrics from OTel Collector.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: otel-collector
static_configs:
- targets: ["otel-collector:8889"]

View File

@@ -0,0 +1,60 @@
# Standalone xrpld configuration with OpenTelemetry enabled.
#
# Usage:
# 1. Start the observability stack:
# docker compose -f docker/telemetry/docker-compose.yml up -d
# 2. Run xrpld in standalone mode:
# ./xrpld --conf docker/telemetry/xrpld-telemetry.cfg -a --start
# 3. Send RPC commands to exercise tracing:
# curl -s http://localhost:5005 -d '{"method":"server_info"}'
# 4. View traces in Jaeger UI: http://localhost:16686
[server]
port_rpc_admin_local
port_ws_admin_local
[port_rpc_admin_local]
port = 5005
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http
[port_ws_admin_local]
port = 6006
ip = 127.0.0.1
admin = 127.0.0.1
protocol = ws
[node_db]
type=NuDB
path=docker/telemetry/data/nudb
online_delete=256
advisory_delete=0
[database_path]
docker/telemetry/data
[debug_logfile]
docker/telemetry/data/debug.log
[rpc_startup]
{ "command": "log_level", "severity": "debug" }
[ssl_verify]
0
# --- OpenTelemetry tracing ---
[telemetry]
enabled=1
service_instance_id=xrpld-standalone
endpoint=http://localhost:4318/v1/traces
exporter=otlp_http
sampling_ratio=1.0
batch_size=512
batch_delay_ms=5000
max_queue_size=2048
trace_rpc=1
trace_transactions=1
trace_consensus=1
trace_peer=0
trace_ledger=1

659
docs/telemetry-runbook.md Normal file
View File

@@ -0,0 +1,659 @@
# xrpld Telemetry Operator Runbook
## Overview
xrpld supports OpenTelemetry distributed tracing to provide visibility into RPC requests, transaction processing, and consensus rounds.
## Quick Start
### 1. Start the observability stack
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
This starts:
- **OTel Collector** on ports 4317 (gRPC) and 4318 (HTTP)
- **Jaeger** UI on http://localhost:16686
- **Prometheus** on http://localhost:9090
- **Grafana** on http://localhost:3000
### 2. Enable telemetry in xrpld
Add to your `xrpld.cfg`:
```ini
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
```
### 3. Build with telemetry support
```bash
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default
```
## Configuration Reference
| Option | Default | Description |
| -------------------------- | --------------------------------- | --------------------------------------------------------- |
| `enabled` | `0` | Master switch for telemetry |
| `endpoint` | `http://localhost:4318/v1/traces` | OTLP/HTTP endpoint |
| `service_name` | `xrpld` | OpenTelemetry service name resource attribute |
| `service_instance_id` | node public key | OpenTelemetry service instance ID resource attribute |
| `sampling_ratio` | `1.0` | Head-based sampling ratio (0.0--1.0) |
| `trace_rpc` | `1` | Enable RPC request tracing |
| `trace_transactions` | `1` | Enable transaction tracing |
| `trace_consensus` | `1` | Enable consensus tracing |
| `trace_peer` | `0` | Enable peer message tracing (high volume) |
| `trace_ledger` | `1` | Enable ledger tracing |
| `consensus_trace_strategy` | `deterministic` | Consensus trace ID strategy (`deterministic` or `random`) |
| `batch_size` | `512` | Max spans per batch export |
| `batch_delay_ms` | `5000` | Delay between batch exports |
| `max_queue_size` | `2048` | Max spans queued before dropping |
| `use_tls` | `0` | Use TLS for exporter connection |
| `tls_ca_cert` | (empty) | Path to CA certificate bundle |
## Span Reference
All spans instrumented in xrpld, grouped by subsystem:
### RPC Spans (Phase 2)
| Span Name | Source File | Attributes | Description |
| -------------------- | ----------------- | ----------------------------------------------------------- | ----------------------------------------------------- |
| `rpc.http_request` | ServerHandler.cpp | `request_payload_size` | Top-level HTTP RPC request |
| `rpc.ws_upgrade` | ServerHandler.cpp | — | WebSocket upgrade handshake |
| `rpc.ws_message` | ServerHandler.cpp | `command` | WebSocket RPC message |
| `rpc.process` | ServerHandler.cpp | `is_batch`, `batch_size` | RPC processing (child of rpc.http_request/ws_message) |
| `rpc.command.<name>` | RPCHandler.cpp | `command`, `version`, `rpc_role`, `rpc_status`, `load_type` | Per-command span (e.g., `rpc.command.server_info`) |
### Transaction Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------ | --------------- | -------------------------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp | `xrpl.tx.hash`, `local`, `path`, `tx_type`, `fee`, `sequence`, `ter_result`, `applied` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp | `xrpl.peer.id`, `xrpl.tx.hash`, `tx_type`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger |
### Transaction Queue Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------------ | ----------- | ------------------------------------------------------------- | -------------------------------------------------- |
| `txq.enqueue` | TxQ.cpp | `xrpl.tx.hash`, `tx_type` | Transaction enqueue decision (child of tx.process) |
| `txq.apply_direct` | TxQ.cpp | -- | Direct apply attempt (bypassing queue) |
| `txq.batch_clear` | TxQ.cpp | -- | Batch clear of queued transactions for an account |
| `txq.accept` | TxQ.cpp | `queue_size`, `ledger_changed` | Ledger-close accept loop over queued transactions |
| `txq.accept_tx` | TxQ.cpp | `xrpl.tx.hash`, `retries_remaining`, `ter_code`, `txq_status` | Per-transaction apply during accept |
| `txq.cleanup` | TxQ.cpp | `xrpl.ledger.seq` | Post-close cleanup of expired queue entries |
### Consensus Spans (Phase 4)
| Span Name | Source File | Attributes | Description |
| ------------------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `consensus.round` | RCLConsensus.cpp | `xrpl.consensus.ledger_id`, `xrpl.ledger.seq`, `xrpl.consensus.mode`, `trace_strategy`, `xrpl.consensus.round_id` | Root span for a consensus round (deterministic or random trace ID) |
| `consensus.phase.open` | Consensus.h | -- | Open phase duration (child of round) |
| `consensus.proposal.send` | RCLConsensus.cpp | `xrpl.consensus.round`, `is_bow_out` | Consensus proposal broadcast |
| `consensus.ledger_close` | RCLConsensus.cpp | `xrpl.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.establish` | Consensus.h | `converge_percent`, `establish_count`, `proposers` | Establish phase duration (child of round) |
| `consensus.update_positions` | Consensus.h | `converge_percent`, `proposers`, `disputes_count` | Position update and dispute resolution (see Events below) |
| `consensus.check` | Consensus.h | `agree_count`, `disagree_count`, `converge_percent`, `have_close_time_consensus`, `threshold_percent`, `consensus_result` | Consensus threshold check |
| `consensus.accept` | RCLConsensus.cpp | `proposers`, `round_time_ms`, `quorum`, `disputes_count`, `consensus_state` | Ledger accepted by consensus |
| `consensus.accept.apply` | RCLConsensus.cpp | `xrpl.ledger.seq`, `close_time`, `close_time_correct`, `close_resolution_ms`, `consensus_state`, `proposing`, `round_time_ms`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction`, `tx_count` | Ledger application with close time details (see Events below) |
| `consensus.validation.send` | RCLConsensus.cpp | `xrpl.ledger.seq`, `proposing`, `ledger_hash`, `full_validation`, `validation_sign_time` | Validation sent after accept (follows-from link) |
| `consensus.mode_change` | RCLConsensus.cpp | `mode_old`, `mode_new` | Consensus mode transition |
| `consensus.proposal.receive` | PeerImp.cpp | `trusted`, `xrpl.consensus.round` | Proposal received from peer (extracts parent context from TraceContext when present; falls back to standalone span for older peers) |
| `consensus.validation.receive` | PeerImp.cpp | `trusted`, `xrpl.ledger.seq` | Validation received from peer (extracts parent context from TraceContext when present; falls back to standalone span for older peers) |
#### Consensus Span Events
| Parent Span | Event Name | Event Attributes | Description |
| ---------------------------- | ----------------- | ---------------------------------------------------------------- | ------------------------------------------------------- |
| `consensus.update_positions` | `dispute.resolve` | `xrpl.tx.id`, `dispute_our_vote`, `dispute_yays`, `dispute_nays` | Emitted per dispute when votes are tallied |
| `consensus.accept.apply` | `tx.included` | `xrpl.tx.id` | Emitted per transaction included in the accepted ledger |
#### Close Time Queries (Tempo TraceQL)
```
# Find rounds where validators disagreed on close time
{name="consensus.accept.apply"} | close_time_correct = false
# Find consensus failures (moved_on)
{name="consensus.accept.apply"} | consensus_state = "moved_on"
# Find slow ledger applications (>5s)
{name="consensus.accept.apply"} | duration > 5s
# Find specific ledger's consensus details
{name="consensus.accept.apply"} | xrpl.ledger.seq = 92345678
# Find all spans in a consensus round (deterministic trace strategy)
{name="consensus.round"} | xrpl.consensus.round_id = "<round_id>"
# Find dispute resolutions
{name="consensus.update_positions"} >> {event:name="dispute.resolve"}
```
### Ledger Spans (Phase 6)
| Span Name | Source File | Attributes | Description |
| ----------------- | -------------------- | ------------------------------------------ | ----------------------------- |
| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Ledger build during consensus |
| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `validations` | Ledger promoted to validated |
| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history |
### Peer Spans (Phase 6)
| Span Name | Source File | Attributes | Description |
| ------------------------- | ---------------- | ------------------------------------ | ----------------------------- |
| `peer.proposal.receive` | PeerImp.cpp:1667 | `xrpl.peer.id`, `proposal_trusted` | Proposal received from peer |
| `peer.validation.receive` | PeerImp.cpp:2264 | `xrpl.peer.id`, `validation_trusted` | Validation received from peer |
---
## Insights and Sample Queries
This section shows what questions you can answer using the span attributes, with example Tempo TraceQL queries.
### Transaction Workflow Analysis
```
# Find all AMM transactions (AMMDeposit, AMMWithdraw, AMMCreate, etc.)
{name="tx.process"} | tx_type =~ "AMM.*"
# Find Payment transactions that failed
{name="tx.process"} | tx_type = "Payment" && ter_result != "tesSUCCESS"
# Compare latency of different transaction types
{name="tx.process"} | tx_type = "OfferCreate"
{name="tx.process"} | tx_type = "Payment"
# Find high-fee transactions (fee > 1 XRP = 1000000 drops)
{name="tx.process"} | fee > 1000000
# Find transactions that were not applied
{name="tx.process"} | applied = false
# Trace a specific transaction by type across the network
{name=~"tx\\..*"} | tx_type = "NFTokenMint"
```
### Transaction Queue Health
```
# Find transactions rejected from the queue
{name="txq.accept_tx"} | txq_status = "failed"
# Which transaction types get queued most often?
{name="txq.enqueue"} | tx_type = "Payment"
{name="txq.enqueue"} | tx_type = "OfferCreate"
# Find ledger closes that applied queued transactions
{name="txq.accept"} | ledger_changed = true
# Find transactions that exhausted retries
{name="txq.accept_tx"} | txq_status = "retried" && retries_remaining = 0
```
### RPC Debugging
```
# Find batch RPC requests
{name="rpc.process"} | is_batch = true
# Find large RPC payloads (>100KB)
{name="rpc.http_request"} | request_payload_size > 100000
# Find resource-heavy RPC commands (by load_type)
{name=~"rpc.command.*"} | load_type = "exception_rpc"
# Find a specific WebSocket command
{name="rpc.ws_message"} | command = "subscribe"
# Find slow pathfinding with many source assets
{name="pathfind.discover"} | pathfind_num_source_assets > 10
```
### PathFinding Performance
```
# Find pathfinding for specific currencies
{name="pathfind.compute"} | pathfind_dest_currency = "USD"
# Find expensive pathfinding (many source assets to explore)
{name="pathfind.discover"} | pathfind_num_source_assets > 20
# Find large pathfinding requests
{name="pathfind.compute"} | duration > 1s
```
### Consensus Health
```
# Find rounds where consensus timed out (expired)
{name="consensus.accept"} | consensus_state = "expired"
# Find rounds where we moved on without full agreement
{name="consensus.accept"} | consensus_state = "moved_on"
# Find rounds with many disputes
{name="consensus.accept"} | disputes_count > 5
# Find bow-out proposals (node resigned from round)
{name="consensus.proposal.send"} | is_bow_out = true
# Correlate validation with its ledger
{name="consensus.validation.send"} | ledger_hash = "<hash>"
# Find rounds where validators disagreed on close time
{name="consensus.accept.apply"} | close_time_correct = false
```
### Cross-Subsystem Correlation
```
# Follow a transaction from receive through queue to ledger
{name=~"tx\\..*|txq\\..*"} | tx_type = "Payment" && duration > 500ms
# Find all NFT-related activity
{name=~"tx\\..*|txq\\..*"} | tx_type =~ "NFToken.*"
# Find consensus rounds with slow transactions
{name="consensus.accept"} | round_time_ms > 5000
```
### Where to Look (Quick Reference)
| Question | Span | Key Attributes |
| ----------------------------------- | --------------------------- | ------------------------------ |
| "Which tx type is slowest?" | `tx.process` | `tx_type` + duration |
| "Why was my tx rejected?" | `tx.process` | `ter_result`, `applied` |
| "Is the TxQ backing up?" | `txq.accept` | `queue_size`, `ledger_changed` |
| "Why was my tx dropped from queue?" | `txq.accept_tx` | `txq_status`, `ter_code` |
| "Are batch requests a problem?" | `rpc.process` | `is_batch`, `batch_size` |
| "Which RPC is expensive?" | `rpc.command.*` | `load_type`, duration |
| "Did consensus stall?" | `consensus.check` | `consensus_stalled` |
| "Was consensus outcome normal?" | `consensus.accept` | `consensus_state` |
| "Did a validator bow out?" | `consensus.proposal.send` | `is_bow_out` |
| "Which ledger was validated?" | `consensus.validation.send` | `ledger_hash` |
---
## Cross-Node Trace Propagation
xrpld propagates trace context across nodes via protobuf `TraceContext` fields
embedded in peer-to-peer messages. When Node A sends a transaction, proposal,
or validation, it injects its active span's trace/span IDs into the protobuf
message. Node B extracts that context on receipt and creates a child span,
linking the two nodes into a single distributed trace.
### How It Works
```
Node A (sender) Node B (receiver)
+-----------------------------+ +-------------------------------+
| tx.process / consensus.* | | PeerImp::onMessage() |
| | | | | |
| v | | v |
| SpanGuard::getTraceBytes() | | extract TraceContext from |
| | | | protobuf message |
| v | send | | |
| injectSpanContext() --------|--------->| v |
| sets TraceContext fields | proto | txReceiveSpan() |
| (trace_id, span_id, flags) | msg | proposalReceiveSpan() |
+-----------------------------+ | validationReceiveSpan() |
| | |
| v |
| child span with parent link |
+-------------------------------+
```
### Send-Side Injection
| Message Type | Injection Point | Mechanism |
| ------------- | -------------------------- | ------------------------------------------ |
| TMTransaction | `NetworkOPs::apply()` | Injects `tx.process` span into relay msg |
| TMProposeSet | `RCLConsensus::propose()` | Injects active context into proposal msg |
| TMValidation | `RCLConsensus::validate()` | Injects active context into validation msg |
### Receive-Side Extraction
| Message Type | Extraction Point | Helper Function |
| ------------- | ----------------------------------- | -------------------------------------------------- |
| TMTransaction | `PeerImp::onMessage(TMTransaction)` | `TxTracing::txReceiveSpan()` |
| TMProposeSet | `PeerImp::onMessage(TMProposeSet)` | `ConsensusReceiveTracing::proposalReceiveSpan()` |
| TMValidation | `PeerImp::onMessage(TMValidation)` | `ConsensusReceiveTracing::validationReceiveSpan()` |
### Key Files
| File | Role |
| ------------------------------------------------- | ----------------------------------------------- |
| `src/xrpld/telemetry/PropagationHelpers.h` | `injectSpanContext()` — SpanGuard to protobuf |
| `include/xrpl/telemetry/TraceContextPropagator.h` | OTel context <-> protobuf conversion primitives |
| `src/xrpld/telemetry/ConsensusReceiveTracing.h` | Proposal/validation receive span factories |
| `src/xrpld/telemetry/TxTracing.h` | Transaction receive span factory |
### Backwards Compatibility
Older peers that do not populate `TraceContext` fields in their messages will
simply produce empty trace bytes on the receive side. The extraction helpers
detect this and create standalone (root) spans instead of child spans. No
errors are logged and no data is lost — the receive span is still created with
all its normal attributes, it just lacks a cross-node parent link.
### Example Tempo Queries
```
# Find cross-node transaction traces (tx.process -> tx.receive across nodes)
{name="tx.receive"} && status != error
# Find proposals received with cross-node parent context
{name="consensus.proposal.receive"} && nestedSetParent > 0
# Trace a transaction across the network by its hash
{name=~"tx\\..*"} | xrpl.tx.hash = "<hash>"
# Find all spans in a cross-node consensus trace
{rootServiceName="xrpld"} | xrpl.consensus.round_id = "<round_id>"
# Compare latency between sender and receiver for validations
{name="consensus.validation.send" || name="consensus.validation.receive"}
```
## Prometheus Metrics (Spanmetrics)
The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in xrpld.
### Generated Metric Names
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ---------------------------- |
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution buckets |
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Latency observation count |
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
### Metric Labels
Every metric carries these standard labels:
| Label | Source | Example |
| -------------- | ------------------ | ---------------------------------------- |
| `span_name` | Span name | `rpc.command.server_info` |
| `status_code` | Span status | `STATUS_CODE_UNSET`, `STATUS_CODE_ERROR` |
| `service_name` | Resource attribute | `xrpld` |
| `span_kind` | Span kind | `SPAN_KIND_INTERNAL` |
Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):
| Span Attribute | Metric Label | Applies To |
| --------------------- | ------------------------------ | ------------------------------- |
| `command` | `xrpl_rpc_command` | `rpc.command.*` spans |
| `rpc_status` | `xrpl_rpc_status` | `rpc.command.*` spans |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
| `local` | `xrpl_tx_local` | `tx.process` spans |
| `proposal_trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` spans |
| `validation_trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` spans |
### Histogram Buckets
Configured in `otel-collector-config.yaml`:
```
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
```
## StatsD Metrics (beast::insight)
xrpld has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
### Configuration
Add to `xrpld.cfg`:
```ini
[insight]
server=statsd
address=127.0.0.1:8125
prefix=xrpld
```
The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics.
### Metric Reference
#### Gauges
| Prometheus Metric | Source | Description |
| ------------------------------------------- | ------------------------- | -------------------------------------------------------------------------- |
| `xrpld_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h:373 | Age of validated ledger (seconds) |
| `xrpld_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h:374 | Age of published ledger (seconds) |
| `xrpld_State_Accounting_{Mode}_duration` | NetworkOPs.cpp:774 | Time in each operating mode (Disconnected/Connected/Syncing/Tracking/Full) |
| `xrpld_State_Accounting_{Mode}_transitions` | NetworkOPs.cpp:780 | Transition count per mode |
| `xrpld_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp:214 | Active inbound peer connections |
| `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp:215 | Active outbound peer connections |
| `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.h:557 | Peer disconnect count |
| `xrpld_job_count` | JobQueue.cpp:26 | Current job queue depth |
| `xrpld_{category}_Bytes_In/Out` | OverlayImpl.h:535 | Overlay traffic bytes per category (57 categories) |
| `xrpld_{category}_Messages_In/Out` | OverlayImpl.h:535 | Overlay traffic messages per category |
#### Counters
| Prometheus Metric | Source | Description |
| ------------------------------- | --------------------- | ------------------------------ |
| `xrpld_rpc_requests` | ServerHandler.cpp:108 | Total RPC request count |
| `xrpld_ledger_fetches` | InboundLedgers.cpp:44 | Ledger fetch request count |
| `xrpld_ledger_history_mismatch` | LedgerHistory.cpp:16 | Ledger hash mismatch count |
| `xrpld_warn` | Logic.h:33 | Resource manager warning count |
| `xrpld_drop` | Logic.h:34 | Resource manager drop count |
#### Histograms (from StatsD timers)
| Prometheus Metric | Source | Description |
| --------------------- | --------------------- | ------------------------------ |
| `xrpld_rpc_time` | ServerHandler.cpp:110 | RPC response time (ms) |
| `xrpld_rpc_size` | ServerHandler.cpp:109 | RPC response size (bytes) |
| `xrpld_ios_latency` | Application.cpp:438 | I/O service loop latency (ms) |
| `xrpld_pathfind_fast` | PathRequests.h:23 | Fast pathfinding duration (ms) |
| `xrpld_pathfind_full` | PathRequests.h:24 | Full pathfinding duration (ms) |
## Grafana Dashboards
Ten dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
### RPC Performance (`xrpld-rpc-perf`)
| Panel | Type | PromQL | Labels Used |
| --------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| RPC Request Rate by Command | timeseries | `sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))` | `xrpl_rpc_command` |
| RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` |
| RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` |
| RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) |
| Overall RPC Throughput | timeseries | `rpc.request` + `rpc.process` rate | — |
| RPC Success vs Error | timeseries | by `status_code` (UNSET vs ERROR) | `status_code` |
| Top Commands by Volume | bargauge | `topk(10, ...)` by `xrpl_rpc_command` | `xrpl_rpc_command` |
| WebSocket Message Rate | stat | `rpc.ws_message` rate | — |
### Transaction Overview (`xrpld-transactions`)
| Panel | Type | PromQL | Labels Used |
| --------------------------------- | ---------- | -------------------------------------------------------------------------------------------- | --------------- |
| Transaction Processing Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive` | `span_name` |
| Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` | — |
| Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` |
| Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` | — |
| TX Processing Duration Heatmap | heatmap | `tx.process` histogram buckets | `le` |
| TX Apply Duration per Ledger | timeseries | p95/p50 of `tx.apply` | — |
| Peer TX Receive Rate | timeseries | `tx.receive` rate | — |
| TX Apply Failed Rate | stat | `tx.apply` with `STATUS_CODE_ERROR` | `status_code` |
### Consensus Health (`xrpld-consensus`)
| Panel | Type | PromQL | Labels Used |
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | --------------------- |
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — |
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — |
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — |
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — |
| Ledger Apply Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept.apply"})` | — |
| Close Time Agreement | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.accept.apply"}[5m])` | — |
| Consensus Mode Over Time | timeseries | `consensus.ledger_close` by `xrpl_consensus_mode` | `xrpl_consensus_mode` |
| Accept vs Close Rate | timeseries | `consensus.accept` vs `consensus.ledger_close` rate | — |
| Validation vs Close Rate | timeseries | `consensus.validation.send` vs `consensus.ledger_close` | — |
| Accept Duration Heatmap | heatmap | `consensus.accept` histogram buckets | `le` |
### Ledger Operations (`xrpld-ledger-ops`)
| Panel | Type | PromQL | Labels Used |
| ----------------------- | ---------- | ---------------------------------------------- | ----------- |
| Ledger Build Rate | stat | `ledger.build` call rate | — |
| Ledger Build Duration | timeseries | p95/p50 of `ledger.build` | — |
| Ledger Validation Rate | stat | `ledger.validate` call rate | — |
| Build Duration Heatmap | heatmap | `ledger.build` histogram buckets | `le` |
| TX Apply Duration | timeseries | p95/p50 of `tx.apply` | — |
| TX Apply Rate | timeseries | `tx.apply` call rate | — |
| Ledger Store Rate | stat | `ledger.store` call rate | — |
| Build vs Close Duration | timeseries | p95 `ledger.build` vs `consensus.ledger_close` | — |
### Peer Network (`xrpld-peer-net`)
Requires `trace_peer=1` in the `[telemetry]` config section.
| Panel | Type | PromQL | Labels Used |
| -------------------------------- | ---------- | --------------------------------- | ------------------------------ |
| Proposal Receive Rate | timeseries | `peer.proposal.receive` rate | — |
| Validation Receive Rate | timeseries | `peer.validation.receive` rate | — |
| Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` |
| Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |
### Node Health -- StatsD (`xrpld-statsd-node-health`)
| Panel | Type | PromQL | Labels Used |
| -------------------------------------- | ---------- | --------------------------------------------------------------- | ----------- |
| Validated Ledger Age | stat | `xrpld_LedgerMaster_Validated_Ledger_Age` | — |
| Published Ledger Age | stat | `xrpld_LedgerMaster_Published_Ledger_Age` | — |
| Operating Mode Duration | timeseries | `xrpld_State_Accounting_*_duration` | — |
| Operating Mode Transitions | timeseries | `xrpld_State_Accounting_*_transitions` | — |
| I/O Latency | timeseries | `histogram_quantile(0.95, xrpld_ios_latency_bucket)` | — |
| Job Queue Depth | timeseries | `xrpld_job_count` | — |
| Ledger Fetch Rate | stat | `rate(xrpld_ledger_fetches[5m])` | — |
| Ledger History Mismatches | stat | `rate(xrpld_ledger_history_mismatch[5m])` | — |
| Key Jobs Execution Time | timeseries | `xrpld_acceptLedger{quantile="$quantile"}` (+ 10 more key jobs) | `quantile` |
| Key Jobs Dequeue Wait Time | timeseries | `xrpld_acceptLedger_q{quantile="$quantile"}` (+ 10 more) | `quantile` |
| FullBelowCache Size | timeseries | `xrpld_Node_family_full_below_cache_size` | — |
| FullBelowCache Hit Rate | gauge | `xrpld_Node_family_full_below_cache_hit_rate` | — |
| Ledger Publish Gap | stat | `Published_Ledger_Age - Validated_Ledger_Age` | — |
| State Duration Rate (Full vs Tracking) | timeseries | `rate(xrpld_State_Accounting_Full_duration[5m]) / 1000000` | — |
| All Jobs Execution Time (Detail) | timeseries | `{__name__=~"xrpld_<all_jobs>", quantile="$quantile"}` | `quantile` |
| All Jobs Dequeue Wait (Detail) | timeseries | `{__name__=~"xrpld_<all_jobs>_q", quantile="$quantile"}` | `quantile` |
### Network Traffic -- StatsD (`xrpld-statsd-network`)
| Panel | Type | PromQL | Labels Used |
| ------------------------------------ | ---------- | ------------------------------------------ | ----------- |
| Active Peers | timeseries | `xrpld_Peer_Finder_Active_*_Peers` | — |
| Peer Disconnects | timeseries | `xrpld_Overlay_Peer_Disconnects` | — |
| Total Network Bytes | timeseries | `rate(xrpld_total_Bytes_In/Out[5m])` | — |
| Total Network Messages | timeseries | `xrpld_total_Messages_In/Out` | — |
| Transaction Traffic | timeseries | `xrpld_transactions_Messages_In/Out` | — |
| Proposal Traffic | timeseries | `xrpld_proposals_Messages_In/Out` | — |
| Validation Traffic | timeseries | `xrpld_validations_Messages_In/Out` | — |
| Traffic by Category | bargauge | `topk(10, xrpld_*_Bytes_In)` | — |
| Duplicate Traffic (Wasted Bandwidth) | timeseries | `rate(xrpld_*_duplicate_Bytes_In/Out[5m])` | — |
| All Traffic Categories (Detail) | timeseries | `topk(15, rate(xrpld_*_Bytes_In[5m]))` | — |
### RPC & Pathfinding -- StatsD (`xrpld-statsd-rpc`)
| Panel | Type | PromQL | Labels Used |
| ------------------------- | ---------- | ------------------------------------------------------ | ----------- |
| RPC Request Rate | stat | `rate(xrpld_rpc_requests[5m])` | — |
| RPC Response Time | timeseries | `histogram_quantile(0.95, xrpld_rpc_time_bucket)` | — |
| RPC Response Size | timeseries | `histogram_quantile(0.95, xrpld_rpc_size_bucket)` | — |
| RPC Response Time Heatmap | heatmap | `xrpld_rpc_time_bucket` | — |
| Pathfinding Fast Duration | timeseries | `histogram_quantile(0.95, xrpld_pathfind_fast_bucket)` | — |
| Pathfinding Full Duration | timeseries | `histogram_quantile(0.95, xrpld_pathfind_full_bucket)` | — |
| Resource Warnings Rate | stat | `rate(xrpld_warn[5m])` | — |
| Resource Drops Rate | stat | `rate(xrpld_drop[5m])` | — |
### Span → Metric → Dashboard Summary
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
| ------------------------------ | -------------------------------------------- | --------------------------------------------- |
| `rpc.http_request` | `{span_name="rpc.http_request"}` | RPC Performance (Overall Throughput) |
| `rpc.ws_upgrade` | `{span_name="rpc.ws_upgrade"}` | -- (available but not paneled) |
| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) |
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) |
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) |
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |
| `tx.apply` | `{span_name="tx.apply"}` | Transaction Overview + Ledger Ops (Apply) |
| `txq.enqueue` | `{span_name="txq.enqueue"}` | -- (available but not paneled) |
| `txq.apply_direct` | `{span_name="txq.apply_direct"}` | -- (available but not paneled) |
| `txq.batch_clear` | `{span_name="txq.batch_clear"}` | -- (available but not paneled) |
| `txq.accept` | `{span_name="txq.accept"}` | -- (available but not paneled) |
| `txq.accept_tx` | `{span_name="txq.accept_tx"}` | -- (available but not paneled) |
| `txq.cleanup` | `{span_name="txq.cleanup"}` | -- (available but not paneled) |
| `consensus.round` | `{span_name="consensus.round"}` | -- (available but not paneled) |
| `consensus.phase.open` | `{span_name="consensus.phase.open"}` | -- (available but not paneled) |
| `consensus.establish` | `{span_name="consensus.establish"}` | -- (available but not paneled) |
| `consensus.update_positions` | `{span_name="consensus.update_positions"}` | -- (available but not paneled) |
| `consensus.check` | `{span_name="consensus.check"}` | -- (available but not paneled) |
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Duration, Rate, Heatmap) |
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close, Mode) |
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
| `consensus.accept.apply` | `{span_name="consensus.accept.apply"}` | Consensus Health (Apply Duration, Close Time) |
| `consensus.mode_change` | `{span_name="consensus.mode_change"}` | -- (available but not paneled) |
| `consensus.proposal.receive` | `{span_name="consensus.proposal.receive"}` | -- (available but not paneled) |
| `consensus.validation.receive` | `{span_name="consensus.validation.receive"}` | -- (available but not paneled) |
| `ledger.build` | `{span_name="ledger.build"}` | Ledger Ops (Build Rate, Duration, Heatmap) |
| `ledger.validate` | `{span_name="ledger.validate"}` | Ledger Ops (Validation Rate) |
| `ledger.store` | `{span_name="ledger.store"}` | Ledger Ops (Store Rate) |
| `peer.proposal.receive` | `{span_name="peer.proposal.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
| `peer.validation.receive` | `{span_name="peer.validation.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
## Troubleshooting
### No traces appearing in Tempo
1. Check xrpld logs for `Telemetry starting` message
2. Verify `enabled=1` in the `[telemetry]` config section
3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces`
4. Check collector logs: `docker compose -f docker/telemetry/docker-compose.yml logs otel-collector`
5. Verify Tempo is receiving data: open Grafana → Explore → select Tempo datasource → search by `service.name = xrpld`
6. Check Tempo logs: `docker compose -f docker/telemetry/docker-compose.yml logs tempo`
### High memory usage
- Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling)
- Reduce `max_queue_size` and `batch_size`
- Disable high-volume trace categories: `trace_peer=0`
### Collector connection failures
- Verify endpoint URL matches collector address
- Check firewall rules for ports 4317/4318
- If using TLS, verify certificate path with `tls_ca_cert`
## Performance Tuning
| Scenario | Recommendation |
| ------------------------ | ------------------------------------------------- |
| Production mainnet | `sampling_ratio=0.01`, `trace_peer=0` |
| Testnet/devnet | `sampling_ratio=1.0` (full tracing) |
| Debugging specific issue | `sampling_ratio=1.0` temporarily |
| High-throughput node | Increase `batch_size=1024`, `max_queue_size=4096` |
## Disabling Telemetry
Set `enabled=0` in config (runtime disable) or build without the flag:
```bash
cmake --preset default -Dtelemetry=OFF
```
When telemetry is compiled out, all trace macros expand to no-ops with zero overhead.

View File

@@ -102,6 +102,10 @@ message TraceContext {
optional bytes trace_id = 1; // 16-byte trace identifier
optional bytes span_id = 2; // 8-byte parent span identifier
optional uint32 trace_flags = 3; // bit 0 = sampled
// TODO: trace_state is reserved for W3C tracestate vendor-specific
// key-value pairs but is not yet read or written by
// TraceContextPropagator. Wire it when cross-vendor trace
// propagation is needed.
optional string trace_state = 4; // RESERVED — see TraceContext header note
}

View File

@@ -125,6 +125,7 @@ inline constexpr auto ledgerSeq = makeStr("ledger_seq");
inline constexpr auto closeTime = makeStr("close_time");
inline constexpr auto closeTimeCorrect = makeStr("close_time_correct");
inline constexpr auto closeResolutionMs = makeStr("close_resolution_ms");
inline constexpr auto ledgerHash = join(join(seg::xrpl, seg::ledger), makeStr("hash"));
} // namespace attr
// ===== Shared attribute values =============================================

View File

@@ -166,7 +166,7 @@ private:
std::string name_;
GaugeImpl::value_type lastValue_{0};
GaugeImpl::value_type value_{0};
bool dirty_{false};
bool dirty_{true};
};
//------------------------------------------------------------------------------
@@ -580,6 +580,9 @@ StatsDEventImpl::doNotify(EventImpl::value_type const& value)
StatsDGaugeImpl::StatsDGaugeImpl(std::string name, std::shared_ptr<StatsDCollectorImp> const& impl)
: impl_(impl), name_(std::move(name))
{
// Start dirty so the initial value (0) is emitted on the first flush.
// Without this, gauges whose value never changes from 0 would never
// appear in downstream metric stores (e.g. Prometheus via StatsD).
impl_->add(*this);
}

View File

@@ -2,6 +2,7 @@
#include <xrpld/app/ledger/LedgerReplay.h>
#include <xrpld/app/ledger/OpenLedger.h>
#include <xrpld/app/ledger/detail/LedgerSpanNames.h>
#include <xrpld/app/main/Application.h>
#include <xrpl/basics/Log.h>
@@ -17,9 +18,13 @@
#include <xrpl/protocol/LedgerHeader.h>
#include <xrpl/protocol/Protocol.h>
#include <xrpl/protocol/SystemParameters.h>
#include <xrpl/telemetry/SpanGuard.h>
#include <xrpl/telemetry/SpanNames.h>
#include <xrpl/tx/apply.h>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <exception>
#include <memory>
#include <set>
@@ -43,6 +48,9 @@ buildLedgerImpl(
beast::Journal j,
ApplyTxs&& applyTxs)
{
using namespace telemetry;
auto buildSpan = SpanGuard::span(TraceCategory::Ledger, seg::ledger, ledger_span::op::build);
auto built = std::make_shared<Ledger>(*parent, closeTime);
if (built->isFlagLedger())
@@ -76,6 +84,14 @@ buildLedgerImpl(
built->header().seq < kXrpLedgerEarliestFees || built->read(keylet::fees()),
"xrpl::buildLedgerImpl : valid ledger fees");
built->setAccepted(closeTime, closeResolution, closeTimeCorrect);
buildSpan.setAttribute(ledger_span::attr::ledgerSeq, static_cast<int64_t>(built->header().seq));
buildSpan.setAttribute(
ledger_span::attr::closeTime, static_cast<int64_t>(closeTime.time_since_epoch().count()));
buildSpan.setAttribute(ledger_span::attr::closeTimeCorrect, closeTimeCorrect);
buildSpan.setAttribute(
ledger_span::attr::closeResolutionMs,
static_cast<int64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(closeResolution).count()));
return built;
}
@@ -99,6 +115,9 @@ applyTransactions(
OpenView& view,
beast::Journal j)
{
using namespace telemetry;
auto applySpan = SpanGuard::span(TraceCategory::Transactions, seg::tx, ledger_span::op::apply);
bool certainRetry = true;
std::size_t count = 0;
@@ -165,6 +184,8 @@ applyTransactions(
// If there are any transactions left, we must have
// tried them in at least one final pass
XRPL_ASSERT(txns.empty() || !certainRetry, "xrpl::applyTransactions : retry transactions");
applySpan.setAttribute(ledger_span::attr::txCount, static_cast<int64_t>(count));
applySpan.setAttribute(ledger_span::attr::txFailed, static_cast<int64_t>(failed.size()));
return count;
}

View File

@@ -6,6 +6,7 @@
#include <xrpld/app/ledger/LedgerReplay.h>
#include <xrpld/app/ledger/LedgerReplayer.h>
#include <xrpld/app/ledger/OpenLedger.h>
#include <xrpld/app/ledger/detail/LedgerSpanNames.h>
#include <xrpld/app/main/Application.h>
#include <xrpld/app/misc/SHAMapStore.h>
#include <xrpld/app/misc/Transaction.h>
@@ -55,6 +56,8 @@
#include <xrpl/shamap/SHAMap.h>
#include <xrpl/shamap/SHAMapMissingNode.h>
#include <xrpl/shamap/SHAMapTreeNode.h>
#include <xrpl/telemetry/SpanGuard.h>
#include <xrpl/telemetry/SpanNames.h>
#include <boost/icl/concept/interval_set.hpp>
@@ -449,6 +452,10 @@ LedgerMaster::fixIndex(LedgerIndex ledgerIndex, LedgerHash const& ledgerHash)
bool
LedgerMaster::storeLedger(std::shared_ptr<Ledger const> ledger)
{
using namespace telemetry;
auto span = SpanGuard::span(TraceCategory::Ledger, seg::ledger, ledger_span::op::store);
span.setAttribute(ledger_span::attr::ledgerSeq, static_cast<int64_t>(ledger->header().seq));
bool const validated = ledger->header().validated;
// Returns true if we already had the ledger
return ledgerHistory_.insert(ledger, validated);
@@ -965,6 +972,11 @@ LedgerMaster::checkAccept(std::shared_ptr<Ledger const> const& ledger)
return;
}
using namespace telemetry;
auto valSpan = SpanGuard::span(TraceCategory::Ledger, seg::ledger, ledger_span::op::validate);
valSpan.setAttribute(ledger_span::attr::ledgerSeq, static_cast<int64_t>(ledger->header().seq));
valSpan.setAttribute(ledger_span::attr::validations, static_cast<int64_t>(tvc));
JLOG(journal_.info()) << "Advancing accepted ledger to " << ledger->header().seq
<< " with >= " << minVal << " validations";

View File

@@ -0,0 +1,45 @@
#pragma once
/** Compile-time span name constants for ledger tracing.
*
* Used by BuildLedger and LedgerMaster for ledger lifecycle spans.
* Built on StaticStr/join() from SpanNames.h.
*
* Span hierarchy:
*
* ledger.build (BuildLedger — ledger construction)
* ledger.store (LedgerMaster — ledger storage)
* ledger.validate (LedgerMaster — ledger validation acceptance)
* tx.apply (BuildLedger — transaction application)
*/
#include <xrpl/telemetry/SpanNames.h>
namespace xrpl::telemetry::ledger_span {
// ===== Span operation suffixes ===============================================
namespace op {
inline constexpr auto build = makeStr("build");
inline constexpr auto store = makeStr("store");
inline constexpr auto validate = makeStr("validate");
inline constexpr auto apply = makeStr("apply");
} // namespace op
// ===== Attribute keys ========================================================
namespace attr {
/// Canonical shared constants (defined in SpanNames.h).
using ::xrpl::telemetry::attr::closeResolutionMs;
using ::xrpl::telemetry::attr::closeTime;
using ::xrpl::telemetry::attr::closeTimeCorrect;
using ::xrpl::telemetry::attr::ledgerHash;
using ::xrpl::telemetry::attr::ledgerSeq;
/// Domain-owned bare attrs.
inline constexpr auto txCount = makeStr("tx_count");
inline constexpr auto txFailed = makeStr("tx_failed");
inline constexpr auto validations = makeStr("validations");
} // namespace attr
} // namespace xrpl::telemetry::ledger_span

View File

@@ -151,6 +151,7 @@ private:
beast::Journal journal_;
beast::IOLatencyProbe<std::chrono::steady_clock> probe_;
std::atomic<std::chrono::milliseconds> lastSample_;
std::atomic<bool> firstSample_;
public:
IOLatencySampler(
@@ -158,7 +159,7 @@ private:
beast::Journal journal,
std::chrono::milliseconds interval,
boost::asio::io_context& ios)
: event_(std::move(ev)), journal_(journal), probe_(interval, ios)
: event_(std::move(ev)), journal_(journal), probe_(interval, ios), firstSample_(true)
{
}
@@ -177,7 +178,10 @@ private:
lastSample_ = lastSample;
if (lastSample >= 10ms)
// Always emit the first sample so the metric is registered in
// downstream stores (Prometheus via StatsD). After that, only
// report latency >= 10 ms to avoid flooding with sub-ms values.
if (firstSample_.exchange(false) || lastSample >= 10ms)
event_.notify(lastSample);
if (lastSample >= 500ms)
{

View File

@@ -17,6 +17,7 @@
#include <xrpld/overlay/ReduceRelayCommon.h>
#include <xrpld/overlay/detail/Handshake.h>
#include <xrpld/overlay/detail/OverlayImpl.h>
#include <xrpld/overlay/detail/PeerSpanNames.h>
#include <xrpld/overlay/detail/ProtocolMessage.h>
#include <xrpld/overlay/detail/ProtocolVersion.h>
#include <xrpld/overlay/detail/TrafficCount.h>
@@ -68,6 +69,7 @@
#include <xrpl/server/NetworkOPs.h>
#include <xrpl/shamap/SHAMapNodeID.h>
#include <xrpl/telemetry/SpanGuard.h>
#include <xrpl/telemetry/SpanNames.h>
#include <xrpl/tx/apply.h>
#include <boost/algorithm/string/predicate.hpp>
@@ -1767,6 +1769,10 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMLedgerData> const& m)
void
PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
{
using namespace telemetry;
auto span = SpanGuard::span(TraceCategory::Peer, seg::peer, peer_span::op::proposalReceive);
span.setAttribute(peer_span::attr::peerId, static_cast<int64_t>(id_));
protocol::TMProposeSet const& set = *m;
auto const sig = makeSlice(set.signature());
@@ -1793,6 +1799,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
// every time a spam packet is received
PublicKey const publicKey{makeSlice(set.nodepubkey())};
auto const isTrusted = app_.getValidators().trusted(publicKey);
span.setAttribute(peer_span::attr::proposalTrusted, isTrusted);
// If the operator has specified that untrusted proposals be dropped then
// this happens here I.e. before further wasting CPU verifying the signature
@@ -1861,19 +1868,17 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
app_.getTimeKeeper().closeTime(),
calcNodeID(app_.getValidatorManifests().getMasterKey(publicKey))});
// Create a receive span that links to the sender's trace context
// (if propagated). shared_ptr keeps it alive across the job boundary.
auto span = std::make_shared<telemetry::SpanGuard>(telemetry::proposalReceiveSpan(set));
span->setAttribute(telemetry::consensus::span::attr::trusted, isTrusted);
span->setAttribute(
auto consSpan = std::make_shared<telemetry::SpanGuard>(telemetry::proposalReceiveSpan(set));
consSpan->setAttribute(telemetry::consensus::span::attr::trusted, isTrusted);
consSpan->setAttribute(
telemetry::consensus::span::attr::round, static_cast<int64_t>(set.proposeseq()));
// First 16 hex chars (8 bytes) of each hash — enough to disambiguate
// peer positions and prior ledgers without exporting full 32-byte
// hashes on every receive event.
span->setAttribute(
consSpan->setAttribute(
telemetry::consensus::span::attr::prevLedgerPrefix,
to_string(prevLedger).substr(0, 16).c_str());
span->setAttribute(
consSpan->setAttribute(
telemetry::consensus::span::attr::positionHashPrefix,
to_string(proposeHash).substr(0, 16).c_str());
@@ -1881,7 +1886,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
app_.getJobQueue().addJob(
isTrusted ? JtProposalT : JtProposalUt,
"checkPropose",
[weak, isTrusted, m, proposal, sp = std::move(span)]() {
[weak, isTrusted, m, proposal, sp = std::move(consSpan)]() {
if (auto peer = weak.lock())
peer->checkPropose(isTrusted, m, proposal);
});
@@ -2380,6 +2385,11 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidatorListCollection> const& m
void
PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
{
using namespace telemetry;
auto valSpan =
SpanGuard::span(TraceCategory::Peer, seg::peer, peer_span::op::validationReceive);
valSpan.setAttribute(peer_span::attr::peerId, static_cast<int64_t>(id_));
if (m->validation().size() < 50)
{
JLOG(pJournal_.warn()) << "Validation: Too small";
@@ -2402,6 +2412,8 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
false);
val->setSeen(closeTime);
}
valSpan.setAttribute(peer_span::attr::ledgerHash, to_string(val->getLedgerHash()).c_str());
valSpan.setAttribute(peer_span::attr::validationFull, val->isFull());
if (!isCurrent(
app_.getValidations().parms(),
@@ -2418,6 +2430,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
// suppression for 30 seconds to avoid doing a relatively expensive
// lookup every time a spam packet is received
auto const isTrusted = app_.getValidators().trusted(val->getSignerPublic());
valSpan.setAttribute(peer_span::attr::validationTrusted, isTrusted);
// If the operator has specified that untrusted validations be
// dropped then this happens here I.e. before further wasting CPU
@@ -2455,18 +2468,17 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
return;
}
// Create a receive span that links to the sender's trace context
// (if propagated). shared_ptr keeps it alive across the job boundary.
auto span = std::make_shared<telemetry::SpanGuard>(telemetry::validationReceiveSpan(*m));
span->setAttribute(telemetry::consensus::span::attr::trusted, isTrusted);
auto consSpan =
std::make_shared<telemetry::SpanGuard>(telemetry::validationReceiveSpan(*m));
consSpan->setAttribute(telemetry::consensus::span::attr::trusted, isTrusted);
if (val->isFieldPresent(sfLedgerSequence))
{
span->setAttribute(
consSpan->setAttribute(
telemetry::consensus::span::attr::ledgerSeq,
static_cast<int64_t>(val->getFieldU32(sfLedgerSequence)));
}
span->setAttribute(telemetry::consensus::span::attr::fullValidation, val->isFull());
span->setAttribute(
consSpan->setAttribute(telemetry::consensus::span::attr::fullValidation, val->isFull());
consSpan->setAttribute(
telemetry::consensus::span::attr::validationSignTime,
static_cast<int64_t>(val->getSignTime().time_since_epoch().count()));
@@ -2482,7 +2494,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
app_.getJobQueue().addJob(
isTrusted ? JtValidationT : JtValidationUt,
name,
[weak, val, m, key, sp = std::move(span)]() {
[weak, val, m, key, sp = std::move(consSpan)]() {
if (auto peer = weak.lock())
peer->checkValidation(val, key, m);
});

View File

@@ -0,0 +1,38 @@
#pragma once
/** Compile-time span name constants for peer overlay tracing.
*
* Used by PeerImp for peer message handling spans (proposals,
* validations). Built on StaticStr/join() from SpanNames.h.
*
* Span hierarchy:
*
* peer.proposal.receive (PeerImp — incoming proposal)
* peer.validation.receive (PeerImp — incoming validation)
*/
#include <xrpl/telemetry/SpanNames.h>
namespace xrpl::telemetry::peer_span {
// ===== Span operation suffixes ===============================================
namespace op {
inline constexpr auto proposalReceive = makeStr("proposal.receive");
inline constexpr auto validationReceive = makeStr("validation.receive");
} // namespace op
// ===== Attribute keys ========================================================
namespace attr {
/// Canonical shared constants (defined in SpanNames.h).
using ::xrpl::telemetry::attr::ledgerHash;
using ::xrpl::telemetry::attr::peerId;
/// Domain-owned bare attrs.
inline constexpr auto proposalTrusted = makeStr("proposal_trusted");
inline constexpr auto validationFull = makeStr("validation_full");
inline constexpr auto validationTrusted = makeStr("validation_trusted");
} // namespace attr
} // namespace xrpl::telemetry::peer_span