Phase 4 added a span catalog in `06-implementation-phases.md` listing the
source location for each consensus span. Line numbers `Consensus.h:707`,
`RCLConsensus.cpp:232/341/492/541/900` drift on every refactor and would
become stale PR after PR. Filename alone is enough for operators to
grep — the RCLConsensus.cpp spans are already unambiguous from the span
name itself.
Phase-6 introduces ledger-operations, peer-network, and the five StatsD
dashboards. Align them with the rest of the chain:
- Rename dashboard UIDs from `rippled-*` to `xrpld-*` so the provisioned
UIDs match the post-rename-script documentation (`docs.sh` rewrites
.md but not .json, so the two drifted). Runbook references
`xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches.
- Add the `$node` template variable + `exported_instance=~"$node"` filter
to every target in the five `statsd-*` dashboards. Mirrors the pattern
already used by consensus-health, ledger-operations, and peer-network
per the project rule that every dashboard must support per-node
filtering.
- Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references
in every dashboard panel description and in docker/telemetry/TESTING.md.
Line numbers drift on every refactor; the filename alone is enough to
grep.
- Replace stale `rpc.request` entries with the real emitted span names
(`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
in TESTING.md so operators can copy-paste the filters and hit real
traces.
- Also drop the `:706` line ref from the `StatsDCollector.cpp` callout
in `06-implementation-phases.md`.
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.
Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
"OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
and the :4317 default from the sample config block and the options-summary
table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
into a new "Planned (not yet implemented)" table citing the phase that will
add each one. Keep the example config minimal so copy-paste does not produce
a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
[x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
Complete" to "Phase 4a Plan" and describe the work as future. Replace the
broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
with a sentence pointing readers to where that spec will land. Renumber
the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
("Effort Summary") was intentionally removed in earlier edits.
- Replace GetSpan() with direct context value check in Logs::format()
to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
detect missing log files, fix silent grep error masking
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace remaining rippled/Ripple references with xrpld/XRPL in
data collection reference, implementation phases, and runbook docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.
Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards
Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark Tasks 5.3 (alert definitions) and 5.6 (training materials) as
"Deferred — post-MVP" in the implementation phases document to
accurately reflect current delivery scope. Add status column to the
Phase 5 task table.
Also fix stale reference to XRPL_TRACE_* macros in Phase 4a section —
the implementation uses SpanGuard factory methods.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip effort/risk columns from task tables and remove the §6.9 Effort
Summary section with its pie chart and resource requirements table.
Renumber §6.10 Quick Wins → §6.9.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip effort/risk columns from task tables and remove the §6.9 Effort
Summary section with its pie chart and resource requirements table.
Renumber §6.10 Quick Wins → §6.9.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>