Commit Graph

334 Commits

Author SHA1 Message Date
Pratik Mankawde
92072ecca4 fix(telemetry): fix CI failures — clang-tidy, levelization, linker
Clang-tidy fixes:
- Concatenate nested namespaces (modernize-concat-nested-namespaces)
  in OTelCollector.h, OTelCollector.cpp, ValidationTracker.h/.cpp
- Add missing direct includes (misc-include-cleaner) in
  ValidationTracker.cpp, test, CollectorManager.cpp, OTelCollector.cpp
- Make lock_guard variables const (misc-const-correctness)
- Add braces around single-line if/else (readability-braces-around-statements)
- Use designated initializer for WindowEvent (modernize-use-designated-initializers)
- Initialize LedgerEvent::seq field (cppcoreguidelines-pro-type-member-init)

Linker fix:
- Add ValidationTracker.cpp as source to xrpl.test.telemetry target
  (it lives in src/xrpld/ but the test links against libxrpl only)

Levelization fix:
- Remove stale dependency edges from ordering.txt that were introduced
  by the erroneous develop-merge commit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:07:25 +01:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
ed31bab500 fix: restore unrelated files to phase-6 state after revert
The blanket revert of f4555c80fe also un-reverted some files that had
been correctly matched to phase-6 (nodestore Backend API refactor,
Vault_test changes). Restore those to the base branch state so the
phase-7 PR only contains telemetry changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:15:14 +01:00
Pratik Mankawde
b62987bda7 Revert "fix: revert all unrelated upstream develop changes from phase-7 PR"
This reverts commit f4555c80fe.
2026-05-06 14:13:59 +01:00
Pratik Mankawde
3f793b2098 fix: revert incorrect docs.sh renames (README.md URL, configLegacyName)
- protocol/README.md: restore historical GitHub URL path (src/ripple/)
- Config.cpp: restore configLegacyName as "rippled.cfg" (legacy name
  must remain as-is for backward compatibility)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:00:03 +01:00
Pratik Mankawde
f4555c80fe fix: revert all unrelated upstream develop changes from phase-7 PR
Reverts 259 files that carried unrelated upstream changes through the
phase-6 merge: enum class removals (cppcoreguidelines-use-enum-class),
scoped_lock→lock_guard conversions (modernize-use-scoped-lock),
nodestore Backend API changes (void const* key), .clang-tidy config,
test infrastructure deletions, and miscellaneous develop changes.

These changes belong on develop, not in the telemetry PR chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:59:24 +01:00
Pratik Mankawde
f44b89b99d fix: revert unrelated develop changes from phase-7 PR
- Revert reusable-build-test-config.yml to develop (action SHA update
  and "Show test failure summary" step removal don't belong here)
- Revert upload-conan-deps.yml to develop (action SHA update)
- Revert features.macro: BatchInnerSigs and Batch back to Supported::no
  (these feature flag changes are unrelated to telemetry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:45:43 +01:00
Pratik Mankawde
f183e9b57f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-04-29 21:17:48 +01:00
Pratik Mankawde
b65f91117f fix: address CI checks (prettier, docs.sh rename, levelization)
- Prettier formatting for markdown docs and OTelCollector header
- docs.sh rippled→xrpld renames in OTelCollector.cpp comments/strings
- Updated levelization ordering with new dependency edges

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:03:22 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
f18ddd95c1 fix(telemetry): address clang-tidy errors on phase4 consensus tracing files
- Add [[nodiscard]] to getConsensusTraceStrategy, getYays, getNays
- Add missing <string>, SpanGuard.h, SpanNames.h includes
- Fix widening cast placement (cast before arithmetic, not after)
- Replace nested ternary with lambda for const dir variable
- Add braces to if/else-if chains in Consensus.h
- Concatenate nested namespaces in ConsensusSpanNames.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:54:23 +01:00
Pratik Mankawde
769668579a Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	.codecov.yml
#	.github/scripts/levelization/results/ordering.txt
#	.github/workflows/reusable-clang-tidy-files.yml
#	CMakeLists.txt
#	OpenTelemetryPlan/00-tracing-fundamentals.md
#	OpenTelemetryPlan/01-architecture-analysis.md
#	OpenTelemetryPlan/02-design-decisions.md
#	OpenTelemetryPlan/03-implementation-strategy.md
#	OpenTelemetryPlan/04-code-samples.md
#	OpenTelemetryPlan/05-configuration-reference.md
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/07-observability-backends.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	OpenTelemetryPlan/POC_taskList.md
#	OpenTelemetryPlan/Phase2_taskList.md
#	OpenTelemetryPlan/Phase3_taskList.md
#	OpenTelemetryPlan/Phase4_taskList.md
#	OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
#	OpenTelemetryPlan/Phase5_taskList.md
#	OpenTelemetryPlan/presentation.md
#	cfg/xrpld-example.cfg
#	conan.lock
#	conanfile.py
#	cspell.config.yaml
#	docker/telemetry/TESTING.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docker/telemetry/integration-test.sh
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/tempo.yaml
#	docker/telemetry/xrpld-telemetry.cfg
#	docs/build/telemetry.md
#	docs/telemetry-runbook.md
#	include/xrpl/core/ServiceRegistry.h
#	include/xrpl/protocol/detail/features.macro
#	include/xrpl/telemetry/SpanGuard.h
#	include/xrpl/telemetry/Telemetry.h
#	include/xrpl/telemetry/TraceContextPropagator.h
#	src/libxrpl/basics/MallocTrim.cpp
#	src/libxrpl/nodestore/backend/MemoryFactory.cpp
#	src/libxrpl/nodestore/backend/NuDBFactory.cpp
#	src/libxrpl/nodestore/backend/RocksDBFactory.cpp
#	src/libxrpl/telemetry/NullTelemetry.cpp
#	src/libxrpl/telemetry/Telemetry.cpp
#	src/libxrpl/telemetry/TelemetryConfig.cpp
#	src/tests/libxrpl/basics/MallocTrim.cpp
#	src/tests/libxrpl/telemetry/TelemetryConfig.cpp
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/consensus/RCLConsensus.h
#	src/xrpld/app/ledger/detail/BuildLedger.cpp
#	src/xrpld/app/ledger/detail/LedgerMaster.cpp
#	src/xrpld/app/main/Application.cpp
#	src/xrpld/app/misc/NetworkOPs.cpp
#	src/xrpld/consensus/Consensus.h
#	src/xrpld/overlay/detail/PeerImp.cpp
#	src/xrpld/rpc/detail/RPCHandler.cpp
#	src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
581ab8f552 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
6154357daa fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ff27e62e1f fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ecd02134fa feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3508917f17 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
f434706eec Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
2026-04-29 17:16:28 +01:00
Pratik Mankawde
54c97daaf1 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
654fe2d30f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
a05ada89ec refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
8afe604aff fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:36 +01:00
Pratik Mankawde
6a8053df2d fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
7f0a8a7ed7 feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
780cc434a7 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00
Pratik Mankawde
20fabbc0ec fix(telemetry): resolve Clang build, clang-tidy, and rename CI failures
- Add [[maybe_unused]] to RAII span variables in PathFind/RipplePathFind
  handlers (Clang -Wunused-variable with -Werror)
- Restore over-renamed values: rippledb, rippled.cfg, historical GitHub URL
- Concatenate nested namespaces in SpanNames.h and PathFindSpanNames.h
  (modernize-concat-nested-namespaces)
- Add missing includes and const qualifiers in test files
- Suppress intentional use-after-move in SpanGuardFactory move test
- Remove unused NetworkOPs.h include from PathRequest.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:02:55 +01:00
Pratik Mankawde
ac11217195 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	include/xrpl/telemetry/TraceContextPropagator.h
2026-04-29 14:24:38 +01:00
Pratik Mankawde
103dd605d2 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
# Conflicts:
#	include/xrpl/telemetry/SpanGuard.h
#	src/xrpld/overlay/detail/PeerImp.cpp
2026-04-29 14:23:31 +01:00
Pratik Mankawde
12b7316f71 feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:21:32 +01:00
Pratik Mankawde
09c5f5c3bf Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 12:31:03 +01:00
Pratik Mankawde
b8d3c52017 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 12:30:59 +01:00
Pratik Mankawde
21dad9a17d docs(telemetry): sync runbook, dashboards, and configs with code
- Add 14 missing spans to runbook (6 TxQ + 8 consensus)
- Fix tx.receive attributes and config table in runbook
- Document dispute.resolve and tx.included span events
- Add spanmetrics dimensions for close_time_correct and tx.suppressed
- Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL
- Wire $consensus_mode template variable to all consensus panels
- Add 10 Tempo search filters for operational attributes
- Apply rename script artifacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:29:53 +01:00
Pratik Mankawde
e07391fe78 chore: apply rename script (rippled → xrpld)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:25:03 +01:00
Pratik Mankawde
c5a59645d9 fix(telemetry): resolve merge conflicts, bashate, and rename for phase 5
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:28:54 +01:00
Pratik Mankawde
c0a5f57cdf Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 11:24:05 +01:00
Pratik Mankawde
fe058d49b4 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 11:21:35 +01:00
Pratik Mankawde
e8c826c816 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-29 11:17:19 +01:00
Pratik Mankawde
d753191d20 Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 11:16:51 +01:00
Pratik Mankawde
d4e91b462e fix(telemetry): resolve clang-tidy warnings in Telemetry interfaces
Use C++17 concatenated namespaces, add [[nodiscard]] to query methods,
add missing direct includes, and use pass-by-value + std::move in
NullTelemetry constructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:16:21 +01:00
Pratik Mankawde
f6105ece98 feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.

- Add Grafana dashboards: RPC performance, transaction overview,
  consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
  prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
  (replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
  Application.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
34ee231d62 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
4f4b4dd199 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:30:31 +01:00
Pratik Mankawde
d87839230a fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
793fe65a96 fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
ded848075d feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:29:53 +01:00
Pratik Mankawde
19eead6955 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
eb51457e69 fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
75bcd4ff53 refactor(telemetry): extract span name constants into modular headers
Centralise scattered string literals into compile-time constants using
StaticStr<N> and join() for dot-separated composition. Shared primitives
live in SpanNames.h; RPC-specific names in RpcSpanNames.h. Future modules
(consensus, peer, ledger) add their own *SpanNames.h without bloating
the central header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:27:31 +01:00