Commit Graph

14143 Commits

Author SHA1 Message Date
Pratik Mankawde
beaf01ae4d fix(telemetry): fix CI failures in phase-6 build, clang-tidy, and rename checks
Build fixes in PeerImp.cpp:
- Rename duplicate `span` variable to `consSpan` in proposal and
  validation handlers to avoid redefinition error
- Fix `->` on non-pointer SpanGuard (now correctly on shared_ptr)
- Fix move-only type copy in lambda capture

Clang-tidy fixes:
- Concatenate nested namespaces in LedgerSpanNames.h and PeerSpanNames.h
- Add missing SpanNames.h includes in BuildLedger.cpp, LedgerMaster.cpp,
  PeerImp.cpp for direct seg:: symbol usage
- Add missing <chrono> and <cstdint> includes in BuildLedger.cpp
- Remove unused Feature.h include from BuildLedger.cpp

Rename check fix:
- Run docs.sh to rename rippled_ metric prefixes to xrpld_ in
  09-data-collection-reference.md and telemetry-runbook.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:09:17 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
51918ef868 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 19:58:54 +01:00
Pratik Mankawde
c8674d61b8 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
# Conflicts:
#	src/xrpld/app/consensus/RCLConsensus.cpp
2026-04-29 19:58:45 +01:00
Pratik Mankawde
cf18032e7f fix(telemetry): address clang-tidy errors on phase3 transaction tracing files
- Add [[maybe_unused]] to RAII span variables in TxQ.cpp
- Remove unused st.h include, add missing to_string header in TxQ.cpp
- Concatenate nested namespaces in TxQSpanNames.h, TxSpanNames.h,
  ConsensusReceiveTracing.h, PropagationHelpers.h, TxTracing.h
- Remove unused TraceContextPropagator.h include from RCLConsensus.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:57:42 +01:00
Pratik Mankawde
f18ddd95c1 fix(telemetry): address clang-tidy errors on phase4 consensus tracing files
- Add [[nodiscard]] to getConsensusTraceStrategy, getYays, getNays
- Add missing <string>, SpanGuard.h, SpanNames.h includes
- Fix widening cast placement (cast before arithmetic, not after)
- Replace nested ternary with lambda for const dir variable
- Add braces to if/else-if chains in Consensus.h
- Concatenate nested namespaces in ConsensusSpanNames.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:54:23 +01:00
Pratik Mankawde
e6266e4e8d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 18:20:23 +01:00
Pratik Mankawde
025620cc4e Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 18:20:19 +01:00
Pratik Mankawde
692ce65f3e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 18:20:04 +01:00
Pratik Mankawde
bb3732c22f Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 18:19:58 +01:00
Pratik Mankawde
87a8780b73 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
# Conflicts:
#	src/xrpld/rpc/detail/RPCHandler.cpp
2026-04-29 18:18:37 +01:00
Pratik Mankawde
78522ba18e Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 18:17:02 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
e21e7b0d51 fix(telemetry): add missing PublicKey.h include for toBase58 in Application.cpp
Clang-tidy misc-include-cleaner requires direct includes for all used
symbols. Application.cpp calls toBase58(TokenType::NodePublic, ...) at
line 1359 but did not directly include PublicKey.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:51:35 +01:00
Pratik Mankawde
3dd2f34591 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-04-29 17:38:03 +01:00
Pratik Mankawde
521e0756e1 docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:33:10 +01:00
Pratik Mankawde
dbcd040180 fix(telemetry): fix Clang unused-variable and incomplete-type errors
- Add [[maybe_unused]] to RAII spans in TxQ.cpp
- Include Telemetry.h in RCLConsensus.cpp for complete type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
17e69e660c feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ef10c754b1 fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
912890c104 fix: address PR review round 2 — event name constants, span timing
- Add cons_span::event namespace with disputeResolve and txIncluded
  constants; replace hardcoded strings in Consensus.h and RCLConsensus.cpp
- Move proposal.receive and validation.receive spans in PeerImp into
  shared_ptr captured by job lambdas so they measure checkPropose and
  checkValidation timing, not just message parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ac68091bec code review changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
2773de7b54 docs(telemetry): mark Phase 4/4a consensus tracing tasks complete
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
5f7de1bb48 feat(telemetry): complete Phase 4 consensus tracing
Implement remaining Phase 4/4a consensus tracing tasks:

- Add consensus.phase.open span (open → closeLedger lifecycle)
- Add consensus.proposal.receive span in PeerImp with trusted attr
- Add consensus.validation.receive span in PeerImp with trusted/seq attrs
- Add tx_count attr on accept.apply, disputes_count on update_positions
- Add tx.included events with txId in doAccept transaction loop
- Enhance dispute.resolve event with yays/nays fields
- Add avalanche_threshold attr on update_positions span
- Reparent accept/accept.apply as children of round span via childSpan()

Also adds compile-time constants in ConsensusSpanNames.h and updates
the span hierarchy diagram.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
264516c37d docs update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
eb84ac57c7 fix(telemetry): remove duplicate hashSpan(4-arg) from rebase
The 4-arg hashSpan overload was duplicated during a prior rebase
cascade — it appeared at both line 240 and line 305 in SpanGuard.cpp.
This would cause a linker error (multiple definition).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
021c81e978 docs(telemetry): document hashSpan factory, ConsensusSpanNames.h, and API details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ab6b6d215e feat(telemetry): add avalanche threshold and close time consensus attributes
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:

- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
  threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
  reached close time consensus in this iteration

These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
53d0daf3b4 fix(telemetry): preserve deterministic trace_id in round spans
Remove the span-replacement logic in startRoundTracing() that was
discarding the hash-derived round span and replacing it with a linked
span (which gets a random trace_id). The deterministic trace_id from
the ledger hash is the key feature enabling cross-node correlation —
replacing it broke correlation on all rounds after the first.

Also: use thread_local mt19937 for hashSpan() span IDs (same fix as
phase-3 txSpan), add Doxygen to establish tracing method declarations
in Consensus.h, and update SpanGuard.h diagram with hashSpan/addEvent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
5cbb349efa fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
93bed03d8d fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
581ab8f552 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
6154357daa fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3a1e462bef docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
30af98200f fix(telemetry): use default_prng() for span IDs, fix non-telemetry build
Replace thread_local mt19937 with xrpl::default_prng() for span ID
generation — uses the project's existing thread-local xor-shift engine.
One call yields a uint64_t (8 bytes), filling the span ID in a single
memcpy without loops.

Fix compilation failure when XRPL_ENABLE_TELEMETRY is not defined:
move xrpl.pb.h include outside the #ifdef guard in TxTracing.h since
protocol::TMTransaction is used unconditionally in the function
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ff27e62e1f fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
7e93e75d8e refactor(telemetry): colocate SpanNames headers with their classes
Move TxSpanNames.h and TxQSpanNames.h from src/xrpld/telemetry/ to sit
next to the classes they instrument, matching the PathFindSpanNames.h
convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ecd02134fa feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3c0eec0209 docs(telemetry): add Task 3.10 TxQ instrumentation to Phase 3 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
7b9e2cf91f feat(telemetry): add TxQ tracing with 6 spans (Tasks 3.9/3.10)
Instrument the transaction queue lifecycle with full span coverage:

- txq.enqueue: wraps TxQ::apply() enqueue/direct/reject decision
  with tx_hash attribute
- txq.apply_direct: wraps TxQ::tryDirectApply() fast-path
- txq.batch_clear: wraps TxQ::tryClearAccountQueueUpThruTx()
  batch clear on high-fee tx
- txq.accept: wraps TxQ::accept() ledger-close dequeue cycle
  with queue_size attribute
- txq.accept_tx: per-tx span inside accept loop with tx_hash,
  ter_code, retries_remaining attributes
- txq.cleanup: wraps TxQ::processClosedLedger() fee metric updates
  and tx expiration with ledger_seq attribute

New file: TxQSpanNames.h with compile-time constants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
312dec2baa docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
be812b8d21 refactor(telemetry): extract TX span name constants into TxSpanNames.h
Move scattered string literals from PeerImp.cpp and NetworkOPs.cpp into
compile-time constants in src/xrpld/telemetry/TxSpanNames.h. Follows
the same StaticStr/join() pattern established in Phase 1c for RPC spans.

Constants cover: span prefixes (tx), operations (receive, process),
attribute keys (hash, local, path, suppressed, status, peerId,
peerVersion), and values (sync, async, knownBad).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
e63a5f72be docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
1e2287e6e1 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3508917f17 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3ed22580fe fix(telemetry): address remaining clang-tidy and cspell CI failures
- Add "hicpp" to cspell dictionary for NOLINT annotations
- Concatenate nested namespaces in RpcSpanNames.h
- Fix include hygiene and nested ternary in RPCHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:31:58 +01:00
Pratik Mankawde
f434706eec Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
2026-04-29 17:16:28 +01:00
Pratik Mankawde
8a54ef1600 docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:08:53 +01:00
Pratik Mankawde
612a32d047 feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:07:10 +01:00