Commit Graph

63 Commits

Author SHA1 Message Date
Pratik Mankawde
a44d91ec27 leftover clang-tidy fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 17:52:45 +01:00
Pratik Mankawde
2f96c6547c Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:51:31 +01:00
Pratik Mankawde
c187a62353 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:47:15 +01:00
Pratik Mankawde
c848e51e13 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:44:07 +01:00
Pratik Mankawde
8f9057729c Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:14:21 +01:00
Pratik Mankawde
f031befc6e compilation fixes and levelization fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:04:19 +01:00
Pratik Mankawde
4e0b6f5b9e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-28 18:32:44 +01:00
Pratik Mankawde
53e8c4d54e fix(docs): apply rename scripts to secure-OTel doc references
Two stray "rippled" tokens introduced by 43258e8d ("docs(telemetry):
add secure-OTel pipeline analysis…") were caught by check-rename in
CI. Re-run docs.sh to convert them to xrpld so the rename check
passes on PR #6425 (and downstream PR #6426 once merged up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 18:27:58 +01:00
Pratik Mankawde
5700eeed1b renaming and namespace updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 17:52:35 +01:00
Pratik Mankawde
7ac5343119 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:09:41 +01:00
Pratik Mankawde
954223958f renames
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:07:34 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
43258e8dc0 docs(telemetry): add secure-OTel pipeline analysis and link into plan
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).

Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.

Adds 'exfiltration' and 'htpasswd' to cspell dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:33:16 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Pratik Mankawde
92bc0b24b8 docs(telemetry): drop volatile line numbers from Phase 4 span-catalog table
Phase 4 added a span catalog in `06-implementation-phases.md` listing the
source location for each consensus span. Line numbers `Consensus.h:707`,
`RCLConsensus.cpp:232/341/492/541/900` drift on every refactor and would
become stale PR after PR. Filename alone is enough for operators to
grep — the RCLConsensus.cpp spans are already unambiguous from the span
name itself.
2026-05-14 16:59:43 +01:00
Pratik Mankawde
41d72cb51b merge: phase-3 (phase-1a docs fixes) into phase-4
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
2026-05-14 16:24:27 +01:00
Pratik Mankawde
45e1c15d24 merge: pratik/otel-phase2-rpc-tracing (phase-1a docs fixes) into pratik/otel-phase3-tx-tracing
# Conflicts:
#	OpenTelemetryPlan/05-configuration-reference.md
2026-05-14 16:13:35 +01:00
Pratik Mankawde
865ab65a07 merge: pratik/otel-phase1c-rpc-integration (phase-1a docs fixes) into pratik/otel-phase2-rpc-tracing 2026-05-14 16:11:04 +01:00
Pratik Mankawde
009c63e7db merge: pratik/otel-phase1b-telemetry-infra (phase-1a docs fixes) into pratik/otel-phase1c-rpc-integration 2026-05-14 16:11:01 +01:00
Pratik Mankawde
5d70a5fffd merge: pratik/otel-phase1a-plan-docs (phase-1a docs fixes) into pratik/otel-phase1b-telemetry-infra 2026-05-14 16:10:59 +01:00
Pratik Mankawde
f3a095ab65 docs(telemetry): align Phase 1a plan docs with Phase 1b implementation
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.

Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
  OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
  diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
  "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
  Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
  branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
  and the :4317 default from the sample config block and the options-summary
  table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
  into a new "Planned (not yet implemented)" table citing the phase that will
  add each one. Keep the example config minimal so copy-paste does not produce
  a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
  [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
  Complete" to "Phase 4a Plan" and describe the work as future. Replace the
  broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
  with a sentence pointing readers to where that spec will land. Renumber
  the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
  ("Effort Summary") was intentionally removed in earlier edits.
2026-05-14 16:09:48 +01:00
Pratik Mankawde
745102360b docs(telemetry): update Phase4 task list for simplified consensus attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:36:22 +01:00
Pratik Mankawde
19d9c44cf5 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:31:35 +01:00
Pratik Mankawde
5c14b57462 docs(telemetry): update Phase3 task list for simplified tx/txq attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:31:22 +01:00
Pratik Mankawde
c875944e05 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:29:32 +01:00
Pratik Mankawde
2430032e3a docs(telemetry): update Phase2 task list + design docs for attr rename
- Phase2_taskList: update attr refs to bare names, note node-health
  attrs moved to resource level.
- 02-design-decisions: strip xrpl.pathfind.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:29:20 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00
Pratik Mankawde
ef10c754b1 fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
2773de7b54 docs(telemetry): mark Phase 4/4a consensus tracing tasks complete
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
264516c37d docs update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
021c81e978 docs(telemetry): document hashSpan factory, ConsensusSpanNames.h, and API details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ab6b6d215e feat(telemetry): add avalanche threshold and close time consensus attributes
Record the close time voting threshold and consensus state on
consensus.update_positions and consensus.check spans:

- xrpl.consensus.close_time_threshold: the avCT_CONSENSUS_PCT (75%)
  threshold required for close time agreement
- xrpl.consensus.have_close_time_consensus: whether validators
  reached close time consensus in this iteration

These attributes enable dashboards to show how the close time
voting process converges (or stalls) across consensus iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
5cbb349efa fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3a1e462bef docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3c0eec0209 docs(telemetry): add Task 3.10 TxQ instrumentation to Phase 3 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
312dec2baa docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
e63a5f72be docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
1e2287e6e1 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
bd6e58a20e fix(telemetry): add missing span constants, fix test API, update levelization
Add unknownCommand and wsUpgrade span name constants to RpcSpanNames.h,
fix SpanGuardFactory tests to use the 3-argument SpanGuard::span() API,
update levelization results, and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:21:13 +01:00
Pratik Mankawde
fa2736277f Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-28 15:07:17 +01:00
Pratik Mankawde
96470e0c8d fix(telemetry): fix include ordering and markdown table formatting
Move Telemetry.h (associated header) to first include position in
Telemetry.cpp per the project's include-order convention. Trim
trailing whitespace from POC_taskList.md markdown table columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:04:09 +01:00
Pratik Mankawde
ed8164d502 docs(telemetry): add Task 2.9 PathFind instrumentation to Phase 2 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
eb51457e69 fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
9bc8cc6b4e docs(telemetry): update Phase2 task list to reflect actual implementation
Mark deferred tasks (2.1→Phase 3, 2.5→low priority) with rationale.
Mark superseded tasks (2.2→Phase 1c SpanGuard factory). Add Task 2.7
for Grafana search filters. Update summary table with status column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
a9ee819ea1 docs(telemetry): add Phase 2-5 task lists and appendix update
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
5e8277f36a docs(telemetry): fix doc references to match pimpl architecture
Replace references to non-existent TracingInstrumentation.h with
SpanGuard.cpp pimpl implementation that actually exists on this branch.
Update conditional compilation section to describe the pimpl approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00