Commit Graph

14100 Commits

Author SHA1 Message Date
Pratik Mankawde
309d3e6449 fixed interround consensus linking and added some state attrs to spans.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 20:15:41 +01:00
Pratik Mankawde
4e0b6f5b9e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-28 18:32:44 +01:00
Pratik Mankawde
a35003b123 fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 18:28:15 +01:00
Pratik Mankawde
53e8c4d54e fix(docs): apply rename scripts to secure-OTel doc references
Two stray "rippled" tokens introduced by 43258e8d ("docs(telemetry):
add secure-OTel pipeline analysis…") were caught by check-rename in
CI. Re-run docs.sh to convert them to xrpld so the rename check
passes on PR #6425 (and downstream PR #6426 once merged up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 18:27:58 +01:00
Pratik Mankawde
c070177800 removed direct opentelemetry code from consensus code.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 18:02:51 +01:00
Pratik Mankawde
5700eeed1b renaming and namespace updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 17:52:35 +01:00
Pratik Mankawde
23387f6a06 merge: phase-3 (consensus tracing removal) into phase-4
Brings in pratik/otel-phase3-tx-tracing's two commits that move all
consensus-tracing content off phase-3:

  - c9521b97fe refactor(telemetry): pull consensus-tracing scope-leak
    out of phase-3
  - d6b101069e refactor(telemetry): remove consensus tracing from
    phase-3

Phase-4 already owns this content (with the consensus_span -> cons_span
namespace rename, round/accept/proposalSend/validationSend span
construction, and the relocated ConsensusSpanNames.h under
src/xrpld/consensus/), plus its own evolution on top — so the merge is
resolved as a tree-identity merge:

  - src/xrpld/app/consensus/ConsensusSpanNames.h: keep phase-4's
    renamed/expanded version (modify/delete conflict, resolved with
    --ours).
  - src/xrpld/app/consensus/RCLConsensus.cpp: keep phase-4's version
    with cons_span attribute calls, the trace_context inject blocks
    on broadcastPropose/validate, the secure-OTel TODO, and the
    full validation/round span instrumentation (content conflict,
    resolved with --ours).
  - src/xrpld/overlay/detail/PeerImp.cpp: keep phase-4's version with
    the proposalReceiveSpan/validationReceiveSpan calls, lambda span
    captures, and cons_span::attr::* setAttribute calls (content
    conflict, resolved with --ours).
  - src/xrpld/telemetry/ConsensusReceiveTracing.h: phase-3 deleted it,
    phase-4 still uses it. Restored from phase-4 HEAD (silent
    auto-deletion otherwise).
  - include/xrpl/telemetry/TraceContextPropagator.h: phase-3 stripped
    consensus references and the secure-OTel TODO; phase-4 still has
    both. Restored from phase-4 HEAD.
  - src/xrpld/telemetry/PropagationHelpers.h: phase-3 swapped the
    @see ConsensusReceiveTracing.h cross-reference for TxTracing.h;
    phase-4 still wants the consensus reference. Restored from
    phase-4 HEAD.

Net tree change on phase-4: zero. Verified via 'git diff <pre-merge-sha>
HEAD' returning empty.

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 17:04:41 +01:00
Pratik Mankawde
d6b101069e refactor(telemetry): remove consensus tracing from phase-3
Phase-3 (PR #6425) is scoped to transaction tracing only; consensus
tracing belongs to phase-4 (PR #6426). The previous commit on this
branch removed the namespace/attribute scaffolding c6c019ed8b leaked
into phase-3, but phase-3 still carried the consensus span construction
and trace-context propagation introduced in earlier commits
(61cb1faf8f, 93bed03d8d). Move that out too so phase-3 creates and
propagates no consensus spans of any kind.

Removed:
  - src/xrpld/telemetry/ConsensusReceiveTracing.h (deleted; phase-4
    owns it).
  - PeerImp.cpp: remove the std::make_shared<SpanGuard>(
    proposalReceiveSpan(...))/validationReceiveSpan(...) constructions
    in onMessage(TMProposeSet)/onMessage(TMValidation), drop the
    sp = std::move(span) lambda captures, and drop the
    #include <xrpld/telemetry/ConsensusReceiveTracing.h>.
  - RCLConsensus.cpp: drop the two telemetry::injectToProtobuf() blocks
    that injected the active trace context into TMProposeSet (in
    Adaptor::propose, after addSuppression) and TMValidation (in
    Adaptor::validate, around the broadcast call). Drop the now-unused
    #include of TraceContextPropagator.h and the
    XRPL_ENABLE_TELEMETRY-gated include of
    opentelemetry/context/runtime_context.h.
  - TraceContextPropagator.h: update file-level @see comment to drop
    the ConsensusReceiveTracing.h reference and to scope the
    "wired into the P2P message flow via PropagationHelpers.h"
    sentence to TMTransaction only.
  - PropagationHelpers.h: replace the
    @see ConsensusReceiveTracing.h cross-reference with
    @see TxTracing.h.

Inert consensus metadata (TraceCategory::Consensus enum value,
seg::consensus constant, isCategoryEnabled/categoryToSpanKind switch
arms, the SpanGuard.h doc-comment example) is intentionally preserved
on phase-3: nothing references it after this commit, but phase-4
needs it and removing it would widen the phase-3 -> phase-4 merge
surface for no benefit.

Verified via git grep: no remaining phase-3 references to
proposalReceiveSpan, validationReceiveSpan, ConsensusReceiveTracing,
consensus_span::, consensus.proposal, or consensus.validation.

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 17:01:46 +01:00
Pratik Mankawde
c9521b97fe refactor(telemetry): pull consensus-tracing scope-leak out of phase-3
Commit c6c019ed8b ("addressed code review comments") bundled tx-tracing
review fixes with consensus-tracing scaffolding that belongs on
pratik/otel-phase4-consensus-tracing (PR #6426). This commit lifts the
consensus-only parts back out of phase-3 so PR #6425 stays scoped to
transaction tracing. Phase-4 already carries the same content (via prior
phase-3 -> phase-4 merges) plus its own evolution on top, so nothing is
moved across — only removed here.

Removed:
  - src/xrpld/app/consensus/ConsensusSpanNames.h (deleted; phase-4 owns
    it).
  - PeerImp.cpp: revert onMessage(TMProposeSet)/onMessage(TMValidation)
    consensus-attr setAttribute calls to the hardcoded
    "xrpl.consensus.{trusted,round,ledger.seq}" strings used before
    c6c019ed8b. Drop the now-unused
    #include <xrpld/app/consensus/ConsensusSpanNames.h>.
  - RCLConsensus::Adaptor::validate(): remove the
    TODO(observability/secure-OTel) block on validation trace_context.
  - TraceContextPropagator.h: remove the TODO(observability/secure-OTel)
    block on injectToProtobuf().

Tx-tracing parts of c6c019ed8b are intentionally untouched.

No phase-3 caller of telemetry::consensus_span:: remains; verified via
git grep. No test on phase-3 references the removed header.

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:51:23 +01:00
Pratik Mankawde
248d85cae6 namespace renaming
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:23:57 +01:00
Pratik Mankawde
7ac5343119 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:09:41 +01:00
Pratik Mankawde
954223958f renames
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:07:34 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
43258e8dc0 docs(telemetry): add secure-OTel pipeline analysis and link into plan
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).

Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.

Adds 'exfiltration' and 'htpasswd' to cspell dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:33:16 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
4c4c6f5de2 build fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:35:41 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Pratik Mankawde
64ffcffe32 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-27 18:27:11 +01:00
Pratik Mankawde
bfdcd3da87 fix(telemetry): use resolved command/method name as span suffix
Per PR #6438 review thread r3250432621: known-command errors
(rpcTOO_BUSY, rpcNO_PERMISSION, etc.) were collapsing into a
single rpc.command.unknown span name, hiding per-command error
rates in dashboards. Same anti-pattern existed for gRPC, where
every method was bucketed under grpc.request with the method
relegated to an attribute.

- RPCHandler.cpp: doCommand error path uses cmdName as the span
  suffix; the rpc_span::val::unknownCommand fallback only applies
  when the request truly omits both command and method fields.
- GRPCServer.cpp: gRPC span name is now grpc.<MethodName>
  (e.g. grpc.GetLedger). Method also retained as an attribute.
- GrpcSpanNames.h: drop the unused op::request constant; update
  the span-hierarchy comment.
- RpcSpanNames.h: update the gRPC span diagram to match.

Dashboards on downstream phases will benefit from per-command
breakdowns without needing TraceQL attribute filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:26:29 +01:00
Pratik Mankawde
f6f0cb1a5f updated class comment
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:55:09 +01:00
Pratik Mankawde
6aa8570d6c addressed code review comments.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:36:06 +01:00
Pratik Mankawde
824f63216a Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:57:08 +01:00
Pratik Mankawde
a104140a51 addressing code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:46:35 +01:00
Pratik Mankawde
92bc0b24b8 docs(telemetry): drop volatile line numbers from Phase 4 span-catalog table
Phase 4 added a span catalog in `06-implementation-phases.md` listing the
source location for each consensus span. Line numbers `Consensus.h:707`,
`RCLConsensus.cpp:232/341/492/541/900` drift on every refactor and would
become stale PR after PR. Filename alone is enough for operators to
grep — the RCLConsensus.cpp spans are already unambiguous from the span
name itself.
2026-05-14 16:59:43 +01:00
Pratik Mankawde
41d72cb51b merge: phase-3 (phase-1a docs fixes) into phase-4
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
2026-05-14 16:24:27 +01:00
Pratik Mankawde
45e1c15d24 merge: pratik/otel-phase2-rpc-tracing (phase-1a docs fixes) into pratik/otel-phase3-tx-tracing
# Conflicts:
#	OpenTelemetryPlan/05-configuration-reference.md
2026-05-14 16:13:35 +01:00
Pratik Mankawde
865ab65a07 merge: pratik/otel-phase1c-rpc-integration (phase-1a docs fixes) into pratik/otel-phase2-rpc-tracing 2026-05-14 16:11:04 +01:00
Pratik Mankawde
009c63e7db merge: pratik/otel-phase1b-telemetry-infra (phase-1a docs fixes) into pratik/otel-phase1c-rpc-integration 2026-05-14 16:11:01 +01:00
Pratik Mankawde
5d70a5fffd merge: pratik/otel-phase1a-plan-docs (phase-1a docs fixes) into pratik/otel-phase1b-telemetry-infra 2026-05-14 16:10:59 +01:00
Pratik Mankawde
f3a095ab65 docs(telemetry): align Phase 1a plan docs with Phase 1b implementation
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.

Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
  OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
  diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
  "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
  Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
  branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
  and the :4317 default from the sample config block and the options-summary
  table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
  into a new "Planned (not yet implemented)" table citing the phase that will
  add each one. Keep the example config minimal so copy-paste does not produce
  a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
  [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
  Complete" to "Phase 4a Plan" and describe the work as future. Replace the
  broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
  with a sentence pointing readers to where that spec will land. Renumber
  the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
  ("Effort Summary") was intentionally removed in earlier edits.
2026-05-14 16:09:48 +01:00
Pratik Mankawde
0b4b3c7bf2 merge: pratik/otel-phase3-tx-tracing fix(SpanKind) into pratik/otel-phase4-consensus-tracing 2026-05-14 15:54:59 +01:00
Pratik Mankawde
3e894f8e93 merge: pratik/otel-phase2-rpc-tracing fix(SpanKind) into pratik/otel-phase3-tx-tracing 2026-05-14 15:54:57 +01:00
Pratik Mankawde
cb7dc5c52e merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing 2026-05-14 15:54:55 +01:00
Pratik Mankawde
9cfb43d8d0 merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration 2026-05-14 15:54:53 +01:00
Pratik Mankawde
7ada57e2a8 fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span()
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.

Add categoryToSpanKind() mapping:
  - Rpc           -> kServer   (inbound synchronous request)
  - Peer          -> kConsumer (inbound async peer message)
  - Transactions  -> kInternal
  - Consensus     -> kInternal
  - Ledger        -> kInternal

Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
2026-05-14 15:53:59 +01:00
Pratik Mankawde
61ab5c6fe3 fix(telemetry): align Tempo consensus search tags with C++ attribute names
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:59:08 +01:00
Pratik Mankawde
837f7e7b50 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-14 13:58:38 +01:00
Pratik Mankawde
b392035544 fix(telemetry): align Tempo TX search tags with C++ attribute names
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:31 +01:00
Pratik Mankawde
450004ebd8 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-14 13:58:19 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
ea30adeb47 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:53:43 +01:00
Pratik Mankawde
9bc2e4abb3 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:53:32 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
745102360b docs(telemetry): update Phase4 task list for simplified consensus attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:36:22 +01:00
Pratik Mankawde
19d9c44cf5 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:31:35 +01:00
Pratik Mankawde
5c14b57462 docs(telemetry): update Phase3 task list for simplified tx/txq attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:31:22 +01:00
Pratik Mankawde
c875944e05 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:29:32 +01:00
Pratik Mankawde
2430032e3a docs(telemetry): update Phase2 task list + design docs for attr rename
- Phase2_taskList: update attr refs to bare names, note node-health
  attrs moved to resource level.
- 02-design-decisions: strip xrpl.pathfind.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:29:20 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00