Commit Graph

10033 Commits

Author SHA1 Message Date
Pratik Mankawde
56090b0ead merge: pratik/otel-phase5-docs-deployment fix(SpanKind) into pratik/otel-phase6-statsd 2026-05-14 15:55:03 +01:00
Pratik Mankawde
0b4b3c7bf2 merge: pratik/otel-phase3-tx-tracing fix(SpanKind) into pratik/otel-phase4-consensus-tracing 2026-05-14 15:54:59 +01:00
Pratik Mankawde
3e894f8e93 merge: pratik/otel-phase2-rpc-tracing fix(SpanKind) into pratik/otel-phase3-tx-tracing 2026-05-14 15:54:57 +01:00
Pratik Mankawde
cb7dc5c52e merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing 2026-05-14 15:54:55 +01:00
Pratik Mankawde
9cfb43d8d0 merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration 2026-05-14 15:54:53 +01:00
Pratik Mankawde
7ada57e2a8 fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span()
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.

Add categoryToSpanKind() mapping:
  - Rpc           -> kServer   (inbound synchronous request)
  - Peer          -> kConsumer (inbound async peer message)
  - Transactions  -> kInternal
  - Consensus     -> kInternal
  - Ledger        -> kInternal

Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
2026-05-14 15:53:59 +01:00
Pratik Mankawde
fe7cb33b65 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:53:47 +01:00
Pratik Mankawde
ea30adeb47 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:53:43 +01:00
Pratik Mankawde
9bc2e4abb3 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:53:32 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
e60efd4d2f Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:10:46 +01:00
Pratik Mankawde
46d1012ad4 refactor(telemetry): simplify consensus attr naming on phase-4 — drop xrpl.consensus. prefix
- Add canonical shared bare attrs to SpanNames.h: closeTime,
  closeTimeCorrect, closeResolutionMs (reused by ledger domain).
- Keep qualified (rule 5): ledgerId, mode, round, roundId.
- Domain-qualify collisions: state -> consensus_state,
  result -> consensus_result.
- Reuse canonical ledgerSeq from phase-3.
- Drop xrpl.consensus.* prefix from 20+ attrs (proposers, round_time_ms,
  converge_percent, avalanche_threshold, etc.).
- Dispute attrs: bare names (dispute_our_vote, dispute_yays, etc.).
- Update call sites in RCLConsensus.cpp, Consensus.h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:04:16 +01:00
Pratik Mankawde
7eeddd3ad9 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:01:13 +01:00
Pratik Mankawde
e339ba1f6b refactor(telemetry): simplify tx/txq attr naming on phase-3 — drop xrpl.<domain>. prefix
- Add canonical shared attrs to SpanNames.h: txHash (xrpl.tx.hash),
  peerId (xrpl.peer.id), ledgerSeq (xrpl.ledger.seq).
- Drop xrpl.tx.* prefix: local, path, suppressed, peer_version.
- Domain-qualify: status -> tx_status, txq status -> txq_status.
- TxQ: tx_hash -> reuse canonical txHash, ledger_seq -> reuse canonical
  ledgerSeq; bare names for fee_level_paid, required_fee_level, etc.
- Update call sites in PeerImp.cpp, NetworkOPs.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:01:00 +01:00
Pratik Mankawde
ac1b01b4c7 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 15:57:45 +01:00
Pratik Mankawde
497dd007d9 refactor(telemetry): simplify attr naming on phase-2 — drop xrpl.pathfind. prefix
- Drop xrpl.pathfind.* prefix from per-span attrs (source_account,
  dest_account, fast, search_level, num_complete_paths, num_paths,
  num_requests).
- Keep xrpl.pathfind.ledger_index qualified (rule 5: distinct from
  xrpl.ledger.seq).
- Remove per-span nodeAmendmentBlocked/nodeServerState calls from
  RPCHandler — promoted to resource-level attrs.
- Mark node-health attrs in SpanNames.h as RESOURCE-ONLY with doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:57:36 +01:00
Pratik Mankawde
0d845149ec Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 15:55:39 +01:00
Pratik Mankawde
7a854ccad2 refactor(telemetry): simplify attr naming on phase-1c — drop xrpl.<domain>. prefix
- Drop xrpl.rpc.* prefix from per-span attrs (command, version).
- Qualify collision-prone fields: role -> rpc_role/grpc_role,
  status -> rpc_status/grpc_status.
- Rename payload_size -> request_payload_size for cross-domain clarity.
- Simplify link.type -> link_type (bare name, no join).
- Update convention doc in SpanNames.h to reflect new naming rules.
- Update telemetry.md doc with renamed attr keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:13 +01:00
Pratik Mankawde
580ee5ede7 fix(telemetry): StatsD gauge and io_latency first-sample emit
Two fixes so gauges register in Prometheus (via StatsD) even when their
initial/steady-state value is 0:

1. StatsDGaugeImpl m_dirty: default-init to true so the initial value
   (0) is emitted on the first flush. Previously, gauges whose value
   never changed from 0 were never flushed and never appeared
   downstream.

2. io_latency_sampler firstSample_: new atomic<bool>, init true.
   m_event.notify now fires when either firstSample_ is true (exchanged
   to false) or lastSample >= 10 ms. This guarantees the io_latency
   metric is registered on startup; subsequent sub-10 ms samples are
   still suppressed to avoid flooding.
2026-05-13 14:40:58 +01:00
Pratik Mankawde
937d11d7c3 fix(telemetry): default tx span attrs on receive path
Set defaults for tx_span::attr::suppressed (false) and
tx_span::attr::status ("new") immediately after creating the txReceive
span. Without defaults, spans whose suppressed/status attributes would
only be set in the HashRouter-suppressed branch lacked these attributes
entirely, producing incomplete span data in downstream stores.

The suppressed branch still overrides these when the transaction has
already been seen via HashRouter.
2026-05-13 14:40:57 +01:00
Pratik Mankawde
beaf01ae4d fix(telemetry): fix CI failures in phase-6 build, clang-tidy, and rename checks
Build fixes in PeerImp.cpp:
- Rename duplicate `span` variable to `consSpan` in proposal and
  validation handlers to avoid redefinition error
- Fix `->` on non-pointer SpanGuard (now correctly on shared_ptr)
- Fix move-only type copy in lambda capture

Clang-tidy fixes:
- Concatenate nested namespaces in LedgerSpanNames.h and PeerSpanNames.h
- Add missing SpanNames.h includes in BuildLedger.cpp, LedgerMaster.cpp,
  PeerImp.cpp for direct seg:: symbol usage
- Add missing <chrono> and <cstdint> includes in BuildLedger.cpp
- Remove unused Feature.h include from BuildLedger.cpp

Rename check fix:
- Run docs.sh to rename rippled_ metric prefixes to xrpld_ in
  09-data-collection-reference.md and telemetry-runbook.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:09:17 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
c8674d61b8 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
# Conflicts:
#	src/xrpld/app/consensus/RCLConsensus.cpp
2026-04-29 19:58:45 +01:00
Pratik Mankawde
cf18032e7f fix(telemetry): address clang-tidy errors on phase3 transaction tracing files
- Add [[maybe_unused]] to RAII span variables in TxQ.cpp
- Remove unused st.h include, add missing to_string header in TxQ.cpp
- Concatenate nested namespaces in TxQSpanNames.h, TxSpanNames.h,
  ConsensusReceiveTracing.h, PropagationHelpers.h, TxTracing.h
- Remove unused TraceContextPropagator.h include from RCLConsensus.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:57:42 +01:00
Pratik Mankawde
f18ddd95c1 fix(telemetry): address clang-tidy errors on phase4 consensus tracing files
- Add [[nodiscard]] to getConsensusTraceStrategy, getYays, getNays
- Add missing <string>, SpanGuard.h, SpanNames.h includes
- Fix widening cast placement (cast before arithmetic, not after)
- Replace nested ternary with lambda for const dir variable
- Add braces to if/else-if chains in Consensus.h
- Concatenate nested namespaces in ConsensusSpanNames.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:54:23 +01:00
Pratik Mankawde
e6266e4e8d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 18:20:23 +01:00
Pratik Mankawde
692ce65f3e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 18:20:04 +01:00
Pratik Mankawde
bb3732c22f Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 18:19:58 +01:00
Pratik Mankawde
87a8780b73 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
# Conflicts:
#	src/xrpld/rpc/detail/RPCHandler.cpp
2026-04-29 18:18:37 +01:00
Pratik Mankawde
78522ba18e Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 18:17:02 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
e21e7b0d51 fix(telemetry): add missing PublicKey.h include for toBase58 in Application.cpp
Clang-tidy misc-include-cleaner requires direct includes for all used
symbols. Application.cpp calls toBase58(TokenType::NodePublic, ...) at
line 1359 but did not directly include PublicKey.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:51:35 +01:00
Pratik Mankawde
3dd2f34591 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-04-29 17:38:03 +01:00
Pratik Mankawde
dbcd040180 fix(telemetry): fix Clang unused-variable and incomplete-type errors
- Add [[maybe_unused]] to RAII spans in TxQ.cpp
- Include Telemetry.h in RCLConsensus.cpp for complete type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
17e69e660c feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ef10c754b1 fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
912890c104 fix: address PR review round 2 — event name constants, span timing
- Add cons_span::event namespace with disputeResolve and txIncluded
  constants; replace hardcoded strings in Consensus.h and RCLConsensus.cpp
- Move proposal.receive and validation.receive spans in PeerImp into
  shared_ptr captured by job lambdas so they measure checkPropose and
  checkValidation timing, not just message parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ac68091bec code review changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
5f7de1bb48 feat(telemetry): complete Phase 4 consensus tracing
Implement remaining Phase 4/4a consensus tracing tasks:

- Add consensus.phase.open span (open → closeLedger lifecycle)
- Add consensus.proposal.receive span in PeerImp with trusted attr
- Add consensus.validation.receive span in PeerImp with trusted/seq attrs
- Add tx_count attr on accept.apply, disputes_count on update_positions
- Add tx.included events with txId in doAccept transaction loop
- Enhance dispute.resolve event with yays/nays fields
- Add avalanche_threshold attr on update_positions span
- Reparent accept/accept.apply as children of round span via childSpan()

Also adds compile-time constants in ConsensusSpanNames.h and updates
the span hierarchy diagram.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
eb84ac57c7 fix(telemetry): remove duplicate hashSpan(4-arg) from rebase
The 4-arg hashSpan overload was duplicated during a prior rebase
cascade — it appeared at both line 240 and line 305 in SpanGuard.cpp.
This would cause a linker error (multiple definition).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ab6b6d215e feat(telemetry): add avalanche threshold and close time consensus attributes
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:

- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
  threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
  reached close time consensus in this iteration

These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
53d0daf3b4 fix(telemetry): preserve deterministic trace_id in round spans
Remove the span-replacement logic in startRoundTracing() that was
discarding the hash-derived round span and replacing it with a linked
span (which gets a random trace_id). The deterministic trace_id from
the ledger hash is the key feature enabling cross-node correlation —
replacing it broke correlation on all rounds after the first.

Also: use thread_local mt19937 for hashSpan() span IDs (same fix as
phase-3 txSpan), add Doxygen to establish tracing method declarations
in Consensus.h, and update SpanGuard.h diagram with hashSpan/addEvent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
5cbb349efa fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
93bed03d8d fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
581ab8f552 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
30af98200f fix(telemetry): use default_prng() for span IDs, fix non-telemetry build
Replace thread_local mt19937 with xrpl::default_prng() for span ID
generation — uses the project's existing thread-local xor-shift engine.
One call yields a uint64_t (8 bytes), filling the span ID in a single
memcpy without loops.

Fix compilation failure when XRPL_ENABLE_TELEMETRY is not defined:
move xrpl.pb.h include outside the #ifdef guard in TxTracing.h since
protocol::TMTransaction is used unconditionally in the function
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ff27e62e1f fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00