Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.
Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards
Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark Tasks 5.3 (alert definitions) and 5.6 (training materials) as
"Deferred — post-MVP" in the implementation phases document to
accurately reflect current delivery scope. Add status column to the
Phase 5 task table.
Also fix stale reference to XRPL_TRACE_* macros in Phase 4a section —
the implementation uses SpanGuard factory methods.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:
- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
reached close time consensus in this iteration
These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move node health attribute strings to compile-time constants in
SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark deferred tasks (2.1→Phase 3, 2.5→low priority) with rationale.
Mark superseded tasks (2.2→Phase 1c SpanGuard factory). Add Task 2.7
for Grafana search filters. Update summary table with status column.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace references to non-existent TracingInstrumentation.h with
SpanGuard.cpp pimpl implementation that actually exists on this branch.
Update conditional compilation section to describe the pimpl approach.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add DiscardFlag.h and FilteringSpanProcessor references to the file
tree, key files table, and implementation summary in OpenTelemetryPlan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove duplicate otlp/tempo exporter block, duplicate tempo service
definition, and jaeger dependency from docker-compose example.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in all plan documentation files, fixing the check-rename
CI failure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plan documents referenced Application.h and app_ for getTelemetry()
but the codebase now uses ServiceRegistry as the interface. Updated:
- 05-configuration-reference.md: getTelemetry() on ServiceRegistry,
deferred serviceInstanceId pattern in ApplicationImp
- POC_taskList.md Task 4: target ServiceRegistry.h not Application.h,
correct config file path and constructor pattern
- 04-code-samples.md: fix overlay() -> getOverlay(), rewrite JobQueue
sample to reflect actual architecture (no app_ member)
- 03-implementation-strategy.md: fix file impact table path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip effort/risk columns from task tables and remove the §6.9 Effort
Summary section with its pie chart and resource requirements table.
Renumber §6.10 Quick Wins → §6.9.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split document index into Plan Documents and Task Lists sections.
These files were introduced in this branch but missing from the index.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>