Commit Graph

14013 Commits

Author SHA1 Message Date
Pratik Mankawde
fe058d49b4 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 11:21:35 +01:00
Pratik Mankawde
bd6e58a20e fix(telemetry): add missing span constants, fix test API, update levelization
Add unknownCommand and wsUpgrade span name constants to RpcSpanNames.h,
fix SpanGuardFactory tests to use the 3-argument SpanGuard::span() API,
update levelization results, and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:21:13 +01:00
Pratik Mankawde
e8c826c816 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-29 11:17:19 +01:00
Pratik Mankawde
d753191d20 Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 11:16:51 +01:00
Pratik Mankawde
d4e91b462e fix(telemetry): resolve clang-tidy warnings in Telemetry interfaces
Use C++17 concatenated namespaces, add [[nodiscard]] to query methods,
add missing direct includes, and use pass-by-value + std::move in
NullTelemetry constructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:16:21 +01:00
Pratik Mankawde
fb25d97077 fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 18:01:50 +01:00
Pratik Mankawde
ebd84a2338 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
# Conflicts:
#	src/libxrpl/telemetry/SpanGuard.cpp
2026-04-28 15:07:54 +01:00
Pratik Mankawde
fa2736277f Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-28 15:07:17 +01:00
Pratik Mankawde
196c309d1d Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
# Conflicts:
#	src/libxrpl/telemetry/Telemetry.cpp
2026-04-28 15:07:07 +01:00
Pratik Mankawde
d46d015fd5 fix(telemetry): fix include ordering in PathFind span files
Sort PathFindSpanNames.h after AssetCache.h alphabetically in
PathRequestManager.cpp and Pathfinder.cpp to satisfy the project's
include-order convention enforced by pre-commit hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:05:57 +01:00
Pratik Mankawde
999bf83f15 fix(telemetry): fix SpanGuard.cpp include ordering
Move SpanGuard.h (associated header) to first include position,
separated by blank line from other project includes, per the
project's include-order convention enforced by pre-commit hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:05:03 +01:00
Pratik Mankawde
96470e0c8d fix(telemetry): fix include ordering and markdown table formatting
Move Telemetry.h (associated header) to first include position in
Telemetry.cpp per the project's include-order convention. Trim
trailing whitespace from POC_taskList.md markdown table columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:04:09 +01:00
Pratik Mankawde
4f4b4dd199 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:30:31 +01:00
Pratik Mankawde
d87839230a fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
e2cb811bf7 docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
2bb0995ff8 fix(telemetry): use default_prng() for span IDs, fix non-telemetry build
Replace thread_local mt19937 with xrpl::default_prng() for span ID
generation — uses the project's existing thread-local xor-shift engine.
One call yields a uint64_t (8 bytes), filling the span ID in a single
memcpy without loops.

Fix compilation failure when XRPL_ENABLE_TELEMETRY is not defined:
move xrpl.pb.h include outside the #ifdef guard in TxTracing.h since
protocol::TMTransaction is used unconditionally in the function
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
793fe65a96 fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
737b0f5488 refactor(telemetry): colocate SpanNames headers with their classes
Move TxSpanNames.h and TxQSpanNames.h from src/xrpld/telemetry/ to sit
next to the classes they instrument, matching the PathFindSpanNames.h
convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
ded848075d feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
397c66cede docs(telemetry): add Task 3.10 TxQ instrumentation to Phase 3 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
2fb165cd54 feat(telemetry): add TxQ tracing with 6 spans (Tasks 3.9/3.10)
Instrument the transaction queue lifecycle with full span coverage:

- txq.enqueue: wraps TxQ::apply() enqueue/direct/reject decision
  with tx_hash attribute
- txq.apply_direct: wraps TxQ::tryDirectApply() fast-path
- txq.batch_clear: wraps TxQ::tryClearAccountQueueUpThruTx()
  batch clear on high-fee tx
- txq.accept: wraps TxQ::accept() ledger-close dequeue cycle
  with queue_size attribute
- txq.accept_tx: per-tx span inside accept loop with tx_hash,
  ter_code, retries_remaining attributes
- txq.cleanup: wraps TxQ::processClosedLedger() fee metric updates
  and tx expiration with ledger_seq attribute

New file: TxQSpanNames.h with compile-time constants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
c585d9b66c docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:16 +01:00
Pratik Mankawde
79ed703bb2 refactor(telemetry): extract TX span name constants into TxSpanNames.h
Move scattered string literals from PeerImp.cpp and NetworkOPs.cpp into
compile-time constants in src/xrpld/telemetry/TxSpanNames.h. Follows
the same StaticStr/join() pattern established in Phase 1c for RPC spans.

Constants cover: span prefixes (tx), operations (receive, process),
attribute keys (hash, local, path, suppressed, status, peerId,
peerVersion), and values (sync, async, knownBad).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
441c88dfb1 docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
178bc916a8 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
19eead6955 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
ed8164d502 docs(telemetry): add Task 2.9 PathFind instrumentation to Phase 2 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
682d7a8d76 feat(telemetry): add PathFind tracing with 5 spans (Tasks 2.9/2.10)
Instrument the path finding subsystem with full span coverage:

- pathfind.request: wraps doPathFind() and doRipplePathFind() RPC handlers
- pathfind.compute: wraps PathRequest::doUpdate() with fast/normal attr
- pathfind.update_all: wraps PathRequestManager::updateAll() on ledger
  close with ledger_index attr
- pathfind.discover: wraps Pathfinder::findPaths() graph exploration
  with search_level attr
- pathfind.rank: wraps Pathfinder::computePathRanks() liquidity
  validation with num_paths attr

New file: PathFindSpanNames.h with compile-time constants following
the StaticStr/join() pattern from Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
eb51457e69 fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
65817c4c57 fix(telemetry): align TelemetryConfig tests with current API
- serviceName default is "xrpld" not "rippled"
- Remove references to nonexistent exporterType field
- Pass networkId (4th param) to setup_Telemetry()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
9bc8cc6b4e docs(telemetry): update Phase2 task list to reflect actual implementation
Mark deferred tasks (2.1→Phase 3, 2.5→low priority) with rationale.
Mark superseded tasks (2.2→Phase 1c SpanGuard factory). Add Task 2.7
for Grafana search filters. Update summary table with status column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
832648c351 feat(telemetry): add RPC trace filters and SpanGuard unit tests
- Grafana Tempo datasource: add rpc-command, rpc-status, rpc-role
  search filters for the Explore UI
- Unit tests: TelemetryConfig (config parsing defaults and sections),
  SpanGuardFactory (null guard safety, move semantics, discard, all
  factory methods)
- Test CMake registration with optional OTel linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
21b58a8885 feat(telemetry): add node health attributes to RPC spans (Task 2.8)
Add amendment_blocked and server_state span attributes to every
rpc.command.* span so operators can correlate RPC behavior with node state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
a9ee819ea1 docs(telemetry): add Phase 2-5 task lists and appendix update
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
736579e473 refactor(telemetry): extract span name constants into modular headers
Centralise scattered string literals into compile-time constants using
StaticStr<N> and join() for dot-separated composition. Shared primitives
live in SpanNames.h; RPC-specific names in RpcSpanNames.h. Future modules
(consensus, peer, ledger) add their own *SpanNames.h without bloating
the central header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
3b93e2d4d9 fix(telemetry): suppress unused span warning and regenerate levelization
- Add [[maybe_unused]] to the RAII span in processSession() — the
  variable is not read but its lifetime scopes the active OTel context
  for child spans created in processRequest()
- Regenerate levelization: remove premature xrpld.telemetry entries
  that reference a module not yet present on this branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
ac9bd2c055 fix(telemetry): use span name constants and fix cardinality risk
- Use grpc_span::val::resourceExhausted constant instead of raw
  "resource_exhausted" string in GRPCServer.cpp
- Fix unbounded span name cardinality in RPCHandler.cpp error path:
  use fixed rpc_span::val::unknownCommand as span name instead of
  user-supplied cmdName (attacker-controlled input). The actual
  command is still captured in the xrpl.rpc.command attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
4124762343 fix(telemetry): pass name_ through CallData::clone()
Without this, cloned CallData instances (created for the next incoming
gRPC request) would have an empty name_, making subsequent span attrs
blank.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
ea8600e204 feat(telemetry): instrument missing critical/medium RPC span paths
Add spans to previously uninstrumented error and validation paths:

- gRPC: span in CallData::process(coro) with method name attribute,
  covers all 4 gRPC endpoints (GetLedger, GetLedgerData, etc.)
- WebSocket parse errors: span in onWSMessage() for invalid JSON
- WebSocket upgrade failures: span in onHandoff() try/catch
- Command dispatch rejections: span in doCommand() when fillHandler()
  fails (unknown command, too busy, permission denied)

New files: GrpcSpanNames.h (gRPC span constants)
Modified: GRPCServer.h (name_ member), RpcSpanNames.h (wsUpgrade op,
updated coverage diagram)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
895e9167b0 docs(telemetry): replace text hierarchy with ASCII box diagrams
Follow project convention (PerfLog.h, SpanGuard.h) for documentation
diagrams. Show HTTP single, HTTP batch, and WebSocket span nesting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
d15d2d2df6 docs(telemetry): add RPC span coverage map to RpcSpanNames.h
Document the span hierarchy, covered paths, and known instrumentation
gaps directly in the header that developers reference when adding spans.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
75bcd4ff53 refactor(telemetry): extract span name constants into modular headers
Centralise scattered string literals into compile-time constants using
StaticStr<N> and join() for dot-separated composition. Shared primitives
live in SpanNames.h; RPC-specific names in RpcSpanNames.h. Future modules
(consensus, peer, ledger) add their own *SpanNames.h without bloating
the central header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
a73117ddd0 refactor(telemetry): update RPC call sites to TraceCategory API
Replace rpcSpan(fullName) calls with span(TraceCategory::Rpc, prefix,
name). Add 'using namespace telemetry' to both RPC files so call sites
read cleanly without repeated namespace qualifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
9e4d943c69 feat(telemetry): replace tracing macros with SpanGuard factory pattern
Delete TracingInstrumentation.h and replace all XRPL_TRACE_* macro
invocations with direct SpanGuard::rpcSpan() calls. SpanGuard's pimpl
design and global Telemetry accessor eliminate the need for macro
wrappers and explicit Telemetry instance passing at call sites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
025a8a344b fix(telemetry): address Phase 1c code review findings
TracingInstrumentation.h:
- Unify all span-creation macros to use std::optional<SpanGuard>
  (fixes type mismatch between XRPL_TRACE_SPAN and SET_ATTR)
- Wrap XRPL_TRACE_SET_ATTR/EXCEPTION in do-while(0) (dangling-else)
- Move macros outside namespace blocks (macros are global)
- Cache telemetry reference to avoid double-evaluation
- Remove leaked _xrpl_span_ intermediate variable
- Add @note tags for thread safety, scope, and usage constraints
- Add 3 usage examples per CLAUDE.md requirements

ServerHandler.cpp:
- Remove misleading rpc.request span from onRequest() (span ended
  before coroutine runs, producing orphan spans)
- Add rpc.http_request span to HTTP processSession() (runs inside
  the coroutine, correct parent for rpc.process/rpc.command spans)
- Add XRPL_TRACE_EXCEPTION and error status in both catch blocks
  (WS processSession and processRequest)

SpanGuard.h:
- Add null guards to all mutating methods (setOk, setStatus,
  setAttribute, addEvent, recordException) for safety after discard()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
9ee9e566d4 removed presentation.md from root
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
0de807b1be Phase 1c: RPC integration - ServerHandler tracing, telemetry config wiring
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00
Pratik Mankawde
59ee027d8a fix(telemetry): resolve clang-tidy warnings in SpanGuard.h
- Concatenate nested namespaces (modernize-concat-nested-namespaces)
- Add [[nodiscard]] to factory and accessor methods
- NOLINT no-op stub instance methods that must stay non-static for API
  parity with the real implementation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00
Pratik Mankawde
7aa4486741 refactor(telemetry): remove unused SpanGuard::span(name) overload
Remove the single-arg span(name) factory that creates unconditional
spans without category gating. All call sites use the 3-arg
span(TraceCategory, prefix, name) variant which checks whether the
category is enabled in config before creating a span. The 1-arg form
was dead code with no callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00
Pratik Mankawde
5e8277f36a docs(telemetry): fix doc references to match pimpl architecture
Replace references to non-existent TracingInstrumentation.h with
SpanGuard.cpp pimpl implementation that actually exists on this branch.
Update conditional compilation section to describe the pimpl approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00