Commit Graph

14064 Commits

Author SHA1 Message Date
Pratik Mankawde
c9521b97fe refactor(telemetry): pull consensus-tracing scope-leak out of phase-3
Commit c6c019ed8b ("addressed code review comments") bundled tx-tracing
review fixes with consensus-tracing scaffolding that belongs on
pratik/otel-phase4-consensus-tracing (PR #6426). This commit lifts the
consensus-only parts back out of phase-3 so PR #6425 stays scoped to
transaction tracing. Phase-4 already carries the same content (via prior
phase-3 -> phase-4 merges) plus its own evolution on top, so nothing is
moved across — only removed here.

Removed:
  - src/xrpld/app/consensus/ConsensusSpanNames.h (deleted; phase-4 owns
    it).
  - PeerImp.cpp: revert onMessage(TMProposeSet)/onMessage(TMValidation)
    consensus-attr setAttribute calls to the hardcoded
    "xrpl.consensus.{trusted,round,ledger.seq}" strings used before
    c6c019ed8b. Drop the now-unused
    #include <xrpld/app/consensus/ConsensusSpanNames.h>.
  - RCLConsensus::Adaptor::validate(): remove the
    TODO(observability/secure-OTel) block on validation trace_context.
  - TraceContextPropagator.h: remove the TODO(observability/secure-OTel)
    block on injectToProtobuf().

Tx-tracing parts of c6c019ed8b are intentionally untouched.

No phase-3 caller of telemetry::consensus_span:: remains; verified via
git grep. No test on phase-3 references the removed header.

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:51:23 +01:00
Pratik Mankawde
954223958f renames
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:07:34 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
43258e8dc0 docs(telemetry): add secure-OTel pipeline analysis and link into plan
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).

Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.

Adds 'exfiltration' and 'htpasswd' to cspell dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:33:16 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
4c4c6f5de2 build fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:35:41 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Pratik Mankawde
64ffcffe32 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-27 18:27:11 +01:00
Pratik Mankawde
bfdcd3da87 fix(telemetry): use resolved command/method name as span suffix
Per PR #6438 review thread r3250432621: known-command errors
(rpcTOO_BUSY, rpcNO_PERMISSION, etc.) were collapsing into a
single rpc.command.unknown span name, hiding per-command error
rates in dashboards. Same anti-pattern existed for gRPC, where
every method was bucketed under grpc.request with the method
relegated to an attribute.

- RPCHandler.cpp: doCommand error path uses cmdName as the span
  suffix; the rpc_span::val::unknownCommand fallback only applies
  when the request truly omits both command and method fields.
- GRPCServer.cpp: gRPC span name is now grpc.<MethodName>
  (e.g. grpc.GetLedger). Method also retained as an attribute.
- GrpcSpanNames.h: drop the unused op::request constant; update
  the span-hierarchy comment.
- RpcSpanNames.h: update the gRPC span diagram to match.

Dashboards on downstream phases will benefit from per-command
breakdowns without needing TraceQL attribute filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:26:29 +01:00
Pratik Mankawde
f6f0cb1a5f updated class comment
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:55:09 +01:00
Pratik Mankawde
6aa8570d6c addressed code review comments.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:36:06 +01:00
Pratik Mankawde
824f63216a Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:57:08 +01:00
Pratik Mankawde
a104140a51 addressing code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:46:35 +01:00
Pratik Mankawde
45e1c15d24 merge: pratik/otel-phase2-rpc-tracing (phase-1a docs fixes) into pratik/otel-phase3-tx-tracing
# Conflicts:
#	OpenTelemetryPlan/05-configuration-reference.md
2026-05-14 16:13:35 +01:00
Pratik Mankawde
865ab65a07 merge: pratik/otel-phase1c-rpc-integration (phase-1a docs fixes) into pratik/otel-phase2-rpc-tracing 2026-05-14 16:11:04 +01:00
Pratik Mankawde
009c63e7db merge: pratik/otel-phase1b-telemetry-infra (phase-1a docs fixes) into pratik/otel-phase1c-rpc-integration 2026-05-14 16:11:01 +01:00
Pratik Mankawde
5d70a5fffd merge: pratik/otel-phase1a-plan-docs (phase-1a docs fixes) into pratik/otel-phase1b-telemetry-infra 2026-05-14 16:10:59 +01:00
Pratik Mankawde
f3a095ab65 docs(telemetry): align Phase 1a plan docs with Phase 1b implementation
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.

Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
  OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
  diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
  "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
  Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
  branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
  and the :4317 default from the sample config block and the options-summary
  table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
  into a new "Planned (not yet implemented)" table citing the phase that will
  add each one. Keep the example config minimal so copy-paste does not produce
  a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
  [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
  Complete" to "Phase 4a Plan" and describe the work as future. Replace the
  broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
  with a sentence pointing readers to where that spec will land. Renumber
  the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
  ("Effort Summary") was intentionally removed in earlier edits.
2026-05-14 16:09:48 +01:00
Pratik Mankawde
3e894f8e93 merge: pratik/otel-phase2-rpc-tracing fix(SpanKind) into pratik/otel-phase3-tx-tracing 2026-05-14 15:54:57 +01:00
Pratik Mankawde
cb7dc5c52e merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing 2026-05-14 15:54:55 +01:00
Pratik Mankawde
9cfb43d8d0 merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration 2026-05-14 15:54:53 +01:00
Pratik Mankawde
7ada57e2a8 fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span()
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.

Add categoryToSpanKind() mapping:
  - Rpc           -> kServer   (inbound synchronous request)
  - Peer          -> kConsumer (inbound async peer message)
  - Transactions  -> kInternal
  - Consensus     -> kInternal
  - Ledger        -> kInternal

Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
2026-05-14 15:53:59 +01:00
Pratik Mankawde
b392035544 fix(telemetry): align Tempo TX search tags with C++ attribute names
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:31 +01:00
Pratik Mankawde
450004ebd8 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-14 13:58:19 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
9bc2e4abb3 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:53:32 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
5c14b57462 docs(telemetry): update Phase3 task list for simplified tx/txq attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:31:22 +01:00
Pratik Mankawde
c875944e05 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:29:32 +01:00
Pratik Mankawde
2430032e3a docs(telemetry): update Phase2 task list + design docs for attr rename
- Phase2_taskList: update attr refs to bare names, note node-health
  attrs moved to resource level.
- 02-design-decisions: strip xrpl.pathfind.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:29:20 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00
Pratik Mankawde
e339ba1f6b refactor(telemetry): simplify tx/txq attr naming on phase-3 — drop xrpl.<domain>. prefix
- Add canonical shared attrs to SpanNames.h: txHash (xrpl.tx.hash),
  peerId (xrpl.peer.id), ledgerSeq (xrpl.ledger.seq).
- Drop xrpl.tx.* prefix: local, path, suppressed, peer_version.
- Domain-qualify: status -> tx_status, txq status -> txq_status.
- TxQ: tx_hash -> reuse canonical txHash, ledger_seq -> reuse canonical
  ledgerSeq; bare names for fee_level_paid, required_fee_level, etc.
- Update call sites in PeerImp.cpp, NetworkOPs.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:01:00 +01:00
Pratik Mankawde
ac1b01b4c7 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 15:57:45 +01:00
Pratik Mankawde
497dd007d9 refactor(telemetry): simplify attr naming on phase-2 — drop xrpl.pathfind. prefix
- Drop xrpl.pathfind.* prefix from per-span attrs (source_account,
  dest_account, fast, search_level, num_complete_paths, num_paths,
  num_requests).
- Keep xrpl.pathfind.ledger_index qualified (rule 5: distinct from
  xrpl.ledger.seq).
- Remove per-span nodeAmendmentBlocked/nodeServerState calls from
  RPCHandler — promoted to resource-level attrs.
- Mark node-health attrs in SpanNames.h as RESOURCE-ONLY with doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:57:36 +01:00
Pratik Mankawde
0d845149ec Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 15:55:39 +01:00
Pratik Mankawde
7a854ccad2 refactor(telemetry): simplify attr naming on phase-1c — drop xrpl.<domain>. prefix
- Drop xrpl.rpc.* prefix from per-span attrs (command, version).
- Qualify collision-prone fields: role -> rpc_role/grpc_role,
  status -> rpc_status/grpc_status.
- Rename payload_size -> request_payload_size for cross-domain clarity.
- Simplify link.type -> link_type (bare name, no join).
- Update convention doc in SpanNames.h to reflect new naming rules.
- Update telemetry.md doc with renamed attr keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:13 +01:00
Pratik Mankawde
937d11d7c3 fix(telemetry): default tx span attrs on receive path
Set defaults for tx_span::attr::suppressed (false) and
tx_span::attr::status ("new") immediately after creating the txReceive
span. Without defaults, spans whose suppressed/status attributes would
only be set in the HashRouter-suppressed branch lacked these attributes
entirely, producing incomplete span data in downstream stores.

The suppressed branch still overrides these when the transaction has
already been seen via HashRouter.
2026-05-13 14:40:57 +01:00
Pratik Mankawde
cf18032e7f fix(telemetry): address clang-tidy errors on phase3 transaction tracing files
- Add [[maybe_unused]] to RAII span variables in TxQ.cpp
- Remove unused st.h include, add missing to_string header in TxQ.cpp
- Concatenate nested namespaces in TxQSpanNames.h, TxSpanNames.h,
  ConsensusReceiveTracing.h, PropagationHelpers.h, TxTracing.h
- Remove unused TraceContextPropagator.h include from RCLConsensus.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:57:42 +01:00
Pratik Mankawde
bb3732c22f Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 18:19:58 +01:00
Pratik Mankawde
87a8780b73 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
# Conflicts:
#	src/xrpld/rpc/detail/RPCHandler.cpp
2026-04-29 18:18:37 +01:00
Pratik Mankawde
78522ba18e Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 18:17:02 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
e21e7b0d51 fix(telemetry): add missing PublicKey.h include for toBase58 in Application.cpp
Clang-tidy misc-include-cleaner requires direct includes for all used
symbols. Application.cpp calls toBase58(TokenType::NodePublic, ...) at
line 1359 but did not directly include PublicKey.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:51:35 +01:00
Pratik Mankawde
61cb1faf8f feat(telemetry): add cross-node trace context propagation
Wire trace context into P2P message flow so distributed traces
link across nodes. TX relay injects SpanGuard context via
PropagationHelpers.h; consensus propose/validate injects via
TraceContextPropagator.h. Receive-side extraction in PeerImp
creates child spans for proposals and validations.

- Add TraceBytes struct and SpanGuard::getTraceBytes() for
  extracting raw trace context without OTel type dependencies
- Add PropagationHelpers.h: injectSpanContext(SpanGuard, proto)
- Add ConsensusReceiveTracing.h: proposalReceiveSpan(),
  validationReceiveSpan() with parent context extraction
- NetworkOPs::apply(): inject tx.process context before relay
- RCLConsensus::propose()/validate(): inject active span context
- PeerImp: create receive spans for proposals and validations
  with sender's trace context as parent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
5cbb349efa fix(telemetry): fix include ordering, levelization, and rename for phase 3
Move TxQSpanNames.h include to correct alphabetical position, update
levelization results for new xrpld.telemetry module dependencies,
and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
93bed03d8d fix: extend tx span lifetimes across async job boundaries
- tx.receive span in PeerImp: convert to shared_ptr, capture in
  checkTransaction lambda so it measures actual processing, not just
  message parsing
- tx.process span in NetworkOPs: convert to shared_ptr, store in
  TransactionStatus so it lives until the batch job processes the entry;
  sync path unchanged (span destructs on function return)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
581ab8f552 refactor(telemetry): replace txSpan with generic hashSpan factory
Replace SpanGuard::txSpan(prefix, name, hash) with the generic
SpanGuard::hashSpan(TraceCategory, name, hash) that accepts a
TraceCategory parameter instead of hardcoding Transactions. This
enables reuse for consensus round spans (Phase 4) and any future
subsystem needing deterministic cross-node trace correlation via
hash-derived trace IDs.

Both overloads are replaced:
- hashSpan(cat, name, hash, size) — standalone with random span_id
- hashSpan(cat, name, hash, size, parentSpanId, parentSize, flags)
  — with remote parent from protobuf context propagation

Add full span name constants (tx_span::receive, tx_span::process)
to TxSpanNames.h following the ConsensusSpanNames.h pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
6154357daa fix(telemetry): add const qualifiers to TraceContextPropagator locals
Mark local variables in extractFromProtobuf() and injectToProtobuf()
as const since they are not modified after initialization: traceId,
spanId, flags, spanCtx, and span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
3a1e462bef docs(telemetry): fix Phase 3 task list stale references and missing deliverables
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00