Commit Graph

362 Commits

Author SHA1 Message Date
Pratik Mankawde
b4e4b57504 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:45:14 +01:00
Pratik Mankawde
cbf389943f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:44:49 +01:00
Pratik Mankawde
57175ab12c Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:37:37 +01:00
Pratik Mankawde
19d9c44cf5 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:31:35 +01:00
Pratik Mankawde
c875944e05 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:29:32 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00
Pratik Mankawde
495d5bd8a0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:17:12 +01:00
Pratik Mankawde
5cd71ed107 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:16:50 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
e60efd4d2f Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:10:46 +01:00
Pratik Mankawde
46d1012ad4 refactor(telemetry): simplify consensus attr naming on phase-4 — drop xrpl.consensus. prefix
- Add canonical shared bare attrs to SpanNames.h: closeTime,
  closeTimeCorrect, closeResolutionMs (reused by ledger domain).
- Keep qualified (rule 5): ledgerId, mode, round, roundId.
- Domain-qualify collisions: state -> consensus_state,
  result -> consensus_result.
- Reuse canonical ledgerSeq from phase-3.
- Drop xrpl.consensus.* prefix from 20+ attrs (proposers, round_time_ms,
  converge_percent, avalanche_threshold, etc.).
- Dispute attrs: bare names (dispute_our_vote, dispute_yays, etc.).
- Update call sites in RCLConsensus.cpp, Consensus.h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:04:16 +01:00
Pratik Mankawde
7eeddd3ad9 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:01:13 +01:00
Pratik Mankawde
e339ba1f6b refactor(telemetry): simplify tx/txq attr naming on phase-3 — drop xrpl.<domain>. prefix
- Add canonical shared attrs to SpanNames.h: txHash (xrpl.tx.hash),
  peerId (xrpl.peer.id), ledgerSeq (xrpl.ledger.seq).
- Drop xrpl.tx.* prefix: local, path, suppressed, peer_version.
- Domain-qualify: status -> tx_status, txq status -> txq_status.
- TxQ: tx_hash -> reuse canonical txHash, ledger_seq -> reuse canonical
  ledgerSeq; bare names for fee_level_paid, required_fee_level, etc.
- Update call sites in PeerImp.cpp, NetworkOPs.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:01:00 +01:00
Pratik Mankawde
ac1b01b4c7 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 15:57:45 +01:00
Pratik Mankawde
497dd007d9 refactor(telemetry): simplify attr naming on phase-2 — drop xrpl.pathfind. prefix
- Drop xrpl.pathfind.* prefix from per-span attrs (source_account,
  dest_account, fast, search_level, num_complete_paths, num_paths,
  num_requests).
- Keep xrpl.pathfind.ledger_index qualified (rule 5: distinct from
  xrpl.ledger.seq).
- Remove per-span nodeAmendmentBlocked/nodeServerState calls from
  RPCHandler — promoted to resource-level attrs.
- Mark node-health attrs in SpanNames.h as RESOURCE-ONLY with doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:57:36 +01:00
Pratik Mankawde
0d845149ec Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 15:55:39 +01:00
Pratik Mankawde
7a854ccad2 refactor(telemetry): simplify attr naming on phase-1c — drop xrpl.<domain>. prefix
- Drop xrpl.rpc.* prefix from per-span attrs (command, version).
- Qualify collision-prone fields: role -> rpc_role/grpc_role,
  status -> rpc_status/grpc_status.
- Rename payload_size -> request_payload_size for cross-domain clarity.
- Simplify link.type -> link_type (bare name, no join).
- Update convention doc in SpanNames.h to reflect new naming rules.
- Update telemetry.md doc with renamed attr keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:13 +01:00
Pratik Mankawde
93a7df6147 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 11:31:16 +01:00
Pratik Mankawde
c096eeb239 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 11:30:22 +01:00
Pratik Mankawde
92072ecca4 fix(telemetry): fix CI failures — clang-tidy, levelization, linker
Clang-tidy fixes:
- Concatenate nested namespaces (modernize-concat-nested-namespaces)
  in OTelCollector.h, OTelCollector.cpp, ValidationTracker.h/.cpp
- Add missing direct includes (misc-include-cleaner) in
  ValidationTracker.cpp, test, CollectorManager.cpp, OTelCollector.cpp
- Make lock_guard variables const (misc-const-correctness)
- Add braces around single-line if/else (readability-braces-around-statements)
- Use designated initializer for WindowEvent (modernize-use-designated-initializers)
- Initialize LedgerEvent::seq field (cppcoreguidelines-pro-type-member-init)

Linker fix:
- Add ValidationTracker.cpp as source to xrpl.test.telemetry target
  (it lives in src/xrpld/ but the test links against libxrpl only)

Levelization fix:
- Remove stale dependency edges from ordering.txt that were introduced
  by the erroneous develop-merge commit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:07:25 +01:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
ed31bab500 fix: restore unrelated files to phase-6 state after revert
The blanket revert of f4555c80fe also un-reverted some files that had
been correctly matched to phase-6 (nodestore Backend API refactor,
Vault_test changes). Restore those to the base branch state so the
phase-7 PR only contains telemetry changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:15:14 +01:00
Pratik Mankawde
b62987bda7 Revert "fix: revert all unrelated upstream develop changes from phase-7 PR"
This reverts commit f4555c80fe.
2026-05-06 14:13:59 +01:00
Pratik Mankawde
40cc7f9ed7 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-04-30 17:00:39 +01:00
Pratik Mankawde
3f793b2098 fix: revert incorrect docs.sh renames (README.md URL, configLegacyName)
- protocol/README.md: restore historical GitHub URL path (src/ripple/)
- Config.cpp: restore configLegacyName as "rippled.cfg" (legacy name
  must remain as-is for backward compatibility)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:00:03 +01:00
Pratik Mankawde
f4555c80fe fix: revert all unrelated upstream develop changes from phase-7 PR
Reverts 259 files that carried unrelated upstream changes through the
phase-6 merge: enum class removals (cppcoreguidelines-use-enum-class),
scoped_lock→lock_guard conversions (modernize-use-scoped-lock),
nodestore Backend API changes (void const* key), .clang-tidy config,
test infrastructure deletions, and miscellaneous develop changes.

These changes belong on develop, not in the telemetry PR chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:59:24 +01:00
Pratik Mankawde
f2bc0b18f2 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-04-30 16:45:55 +01:00
Pratik Mankawde
f44b89b99d fix: revert unrelated develop changes from phase-7 PR
- Revert reusable-build-test-config.yml to develop (action SHA update
  and "Show test failure summary" step removal don't belong here)
- Revert upload-conan-deps.yml to develop (action SHA update)
- Revert features.macro: BatchInnerSigs and Batch back to Supported::no
  (these feature flag changes are unrelated to telemetry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:45:43 +01:00
Pratik Mankawde
58a3be4517 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-04-29 21:18:03 +01:00
Pratik Mankawde
f183e9b57f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-04-29 21:17:48 +01:00
Pratik Mankawde
9e12e660fe Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:25:13 +01:00
Pratik Mankawde
b65f91117f fix: address CI checks (prettier, docs.sh rename, levelization)
- Prettier formatting for markdown docs and OTelCollector header
- docs.sh rippled→xrpld renames in OTelCollector.cpp comments/strings
- Updated levelization ordering with new dependency edges

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:03:22 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
f18ddd95c1 fix(telemetry): address clang-tidy errors on phase4 consensus tracing files
- Add [[nodiscard]] to getConsensusTraceStrategy, getYays, getNays
- Add missing <string>, SpanGuard.h, SpanNames.h includes
- Fix widening cast placement (cast before arithmetic, not after)
- Replace nested ternary with lambda for const dir variable
- Add braces to if/else-if chains in Consensus.h
- Concatenate nested namespaces in ConsensusSpanNames.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:54:23 +01:00
Pratik Mankawde
769668579a Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	.codecov.yml
#	.github/scripts/levelization/results/ordering.txt
#	.github/workflows/reusable-clang-tidy-files.yml
#	CMakeLists.txt
#	OpenTelemetryPlan/00-tracing-fundamentals.md
#	OpenTelemetryPlan/01-architecture-analysis.md
#	OpenTelemetryPlan/02-design-decisions.md
#	OpenTelemetryPlan/03-implementation-strategy.md
#	OpenTelemetryPlan/04-code-samples.md
#	OpenTelemetryPlan/05-configuration-reference.md
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/07-observability-backends.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	OpenTelemetryPlan/POC_taskList.md
#	OpenTelemetryPlan/Phase2_taskList.md
#	OpenTelemetryPlan/Phase3_taskList.md
#	OpenTelemetryPlan/Phase4_taskList.md
#	OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
#	OpenTelemetryPlan/Phase5_taskList.md
#	OpenTelemetryPlan/presentation.md
#	cfg/xrpld-example.cfg
#	conan.lock
#	conanfile.py
#	cspell.config.yaml
#	docker/telemetry/TESTING.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docker/telemetry/integration-test.sh
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/tempo.yaml
#	docker/telemetry/xrpld-telemetry.cfg
#	docs/build/telemetry.md
#	docs/telemetry-runbook.md
#	include/xrpl/core/ServiceRegistry.h
#	include/xrpl/protocol/detail/features.macro
#	include/xrpl/telemetry/SpanGuard.h
#	include/xrpl/telemetry/Telemetry.h
#	include/xrpl/telemetry/TraceContextPropagator.h
#	src/libxrpl/basics/MallocTrim.cpp
#	src/libxrpl/nodestore/backend/MemoryFactory.cpp
#	src/libxrpl/nodestore/backend/NuDBFactory.cpp
#	src/libxrpl/nodestore/backend/RocksDBFactory.cpp
#	src/libxrpl/telemetry/NullTelemetry.cpp
#	src/libxrpl/telemetry/Telemetry.cpp
#	src/libxrpl/telemetry/TelemetryConfig.cpp
#	src/tests/libxrpl/basics/MallocTrim.cpp
#	src/tests/libxrpl/telemetry/TelemetryConfig.cpp
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/consensus/RCLConsensus.h
#	src/xrpld/app/ledger/detail/BuildLedger.cpp
#	src/xrpld/app/ledger/detail/LedgerMaster.cpp
#	src/xrpld/app/main/Application.cpp
#	src/xrpld/app/misc/NetworkOPs.cpp
#	src/xrpld/consensus/Consensus.h
#	src/xrpld/overlay/detail/PeerImp.cpp
#	src/xrpld/rpc/detail/RPCHandler.cpp
#	src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
581ab8f552 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
6154357daa fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ff27e62e1f fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
ecd02134fa feat(telemetry): add hash-derived trace IDs for transaction spans
Derive trace_id from txHash[0:16] so all nodes handling the same
transaction produce spans under the same trace. Protobuf span_id
propagation provides parent-child relay ordering when available.

- Add SpanGuard::txSpan() factory methods (hash-derived trace ID)
- Add TxTracing.h helpers: txReceiveSpan(), txProcessSpan()
- Update PeerImp and NetworkOPs to use the new helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3508917f17 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
f434706eec Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
2026-04-29 17:16:28 +01:00
Pratik Mankawde
54c97daaf1 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:53 +01:00
Pratik Mankawde
654fe2d30f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
a05ada89ec refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:57 +01:00
Pratik Mankawde
8afe604aff fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:36 +01:00
Pratik Mankawde
6a8053df2d fix(telemetry): use thread_local PRNG for span IDs and update class diagram
Replace per-call std::random_device with thread_local std::mt19937 in
txSpan() for span ID generation. random_device is ~423x slower due to
/dev/urandom syscalls on each construction; mt19937 is seeded once per
thread and reused for all subsequent span IDs.

Update the SpanGuard class ASCII diagram to include txSpan factory
methods that were added in the hash-derived trace ID commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:03:15 +01:00