Commit Graph

14028 Commits

Author SHA1 Message Date
Pratik Mankawde
64ffcffe32 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-27 18:27:11 +01:00
Pratik Mankawde
bfdcd3da87 fix(telemetry): use resolved command/method name as span suffix
Per PR #6438 review thread r3250432621: known-command errors
(rpcTOO_BUSY, rpcNO_PERMISSION, etc.) were collapsing into a
single rpc.command.unknown span name, hiding per-command error
rates in dashboards. Same anti-pattern existed for gRPC, where
every method was bucketed under grpc.request with the method
relegated to an attribute.

- RPCHandler.cpp: doCommand error path uses cmdName as the span
  suffix; the rpc_span::val::unknownCommand fallback only applies
  when the request truly omits both command and method fields.
- GRPCServer.cpp: gRPC span name is now grpc.<MethodName>
  (e.g. grpc.GetLedger). Method also retained as an attribute.
- GrpcSpanNames.h: drop the unused op::request constant; update
  the span-hierarchy comment.
- RpcSpanNames.h: update the gRPC span diagram to match.

Dashboards on downstream phases will benefit from per-command
breakdowns without needing TraceQL attribute filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:26:29 +01:00
Pratik Mankawde
f6f0cb1a5f updated class comment
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:55:09 +01:00
Pratik Mankawde
6aa8570d6c addressed code review comments.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 17:36:06 +01:00
Pratik Mankawde
824f63216a Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:57:08 +01:00
Pratik Mankawde
a104140a51 addressing code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 16:46:35 +01:00
Pratik Mankawde
865ab65a07 merge: pratik/otel-phase1c-rpc-integration (phase-1a docs fixes) into pratik/otel-phase2-rpc-tracing 2026-05-14 16:11:04 +01:00
Pratik Mankawde
009c63e7db merge: pratik/otel-phase1b-telemetry-infra (phase-1a docs fixes) into pratik/otel-phase1c-rpc-integration 2026-05-14 16:11:01 +01:00
Pratik Mankawde
5d70a5fffd merge: pratik/otel-phase1a-plan-docs (phase-1a docs fixes) into pratik/otel-phase1b-telemetry-infra 2026-05-14 16:10:59 +01:00
Pratik Mankawde
f3a095ab65 docs(telemetry): align Phase 1a plan docs with Phase 1b implementation
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.

Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
  OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
  diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
  "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
  Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
  branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
  and the :4317 default from the sample config block and the options-summary
  table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
  into a new "Planned (not yet implemented)" table citing the phase that will
  add each one. Keep the example config minimal so copy-paste does not produce
  a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
  [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
  Complete" to "Phase 4a Plan" and describe the work as future. Replace the
  broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
  with a sentence pointing readers to where that spec will land. Renumber
  the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
  ("Effort Summary") was intentionally removed in earlier edits.
2026-05-14 16:09:48 +01:00
Pratik Mankawde
cb7dc5c52e merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing 2026-05-14 15:54:55 +01:00
Pratik Mankawde
9cfb43d8d0 merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration 2026-05-14 15:54:53 +01:00
Pratik Mankawde
7ada57e2a8 fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span()
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.

Add categoryToSpanKind() mapping:
  - Rpc           -> kServer   (inbound synchronous request)
  - Peer          -> kConsumer (inbound async peer message)
  - Transactions  -> kInternal
  - Consensus     -> kInternal
  - Ledger        -> kInternal

Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
2026-05-14 15:53:59 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
2430032e3a docs(telemetry): update Phase2 task list + design docs for attr rename
- Phase2_taskList: update attr refs to bare names, note node-health
  attrs moved to resource level.
- 02-design-decisions: strip xrpl.pathfind.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:29:20 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00
Pratik Mankawde
497dd007d9 refactor(telemetry): simplify attr naming on phase-2 — drop xrpl.pathfind. prefix
- Drop xrpl.pathfind.* prefix from per-span attrs (source_account,
  dest_account, fast, search_level, num_complete_paths, num_paths,
  num_requests).
- Keep xrpl.pathfind.ledger_index qualified (rule 5: distinct from
  xrpl.ledger.seq).
- Remove per-span nodeAmendmentBlocked/nodeServerState calls from
  RPCHandler — promoted to resource-level attrs.
- Mark node-health attrs in SpanNames.h as RESOURCE-ONLY with doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:57:36 +01:00
Pratik Mankawde
0d845149ec Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 15:55:39 +01:00
Pratik Mankawde
7a854ccad2 refactor(telemetry): simplify attr naming on phase-1c — drop xrpl.<domain>. prefix
- Drop xrpl.rpc.* prefix from per-span attrs (command, version).
- Qualify collision-prone fields: role -> rpc_role/grpc_role,
  status -> rpc_status/grpc_status.
- Rename payload_size -> request_payload_size for cross-domain clarity.
- Simplify link.type -> link_type (bare name, no join).
- Update convention doc in SpanNames.h to reflect new naming rules.
- Update telemetry.md doc with renamed attr keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:13 +01:00
Pratik Mankawde
87a8780b73 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
# Conflicts:
#	src/xrpld/rpc/detail/RPCHandler.cpp
2026-04-29 18:18:37 +01:00
Pratik Mankawde
78522ba18e Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 18:17:02 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
e21e7b0d51 fix(telemetry): add missing PublicKey.h include for toBase58 in Application.cpp
Clang-tidy misc-include-cleaner requires direct includes for all used
symbols. Application.cpp calls toBase58(TokenType::NodePublic, ...) at
line 1359 but did not directly include PublicKey.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:51:35 +01:00
Pratik Mankawde
3ed22580fe fix(telemetry): address remaining clang-tidy and cspell CI failures
- Add "hicpp" to cspell dictionary for NOLINT annotations
- Concatenate nested namespaces in RpcSpanNames.h
- Fix include hygiene and nested ternary in RPCHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:31:58 +01:00
Pratik Mankawde
20fabbc0ec fix(telemetry): resolve Clang build, clang-tidy, and rename CI failures
- Add [[maybe_unused]] to RAII span variables in PathFind/RipplePathFind
  handlers (Clang -Wunused-variable with -Werror)
- Restore over-renamed values: rippledb, rippled.cfg, historical GitHub URL
- Concatenate nested namespaces in SpanNames.h and PathFindSpanNames.h
  (modernize-concat-nested-namespaces)
- Add missing includes and const qualifiers in test files
- Suppress intentional use-after-move in SpanGuardFactory move test
- Remove unused NetworkOPs.h include from PathRequest.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:02:55 +01:00
Pratik Mankawde
831be14fd9 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-29 13:01:14 +01:00
Pratik Mankawde
019e84f0d2 Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 13:01:08 +01:00
Pratik Mankawde
694abe2004 docs(telemetry): add thread-safety comments to stop() and sdkProvider_.reset()
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:00:39 +01:00
Pratik Mankawde
e07391fe78 chore: apply rename script (rippled → xrpld)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:25:03 +01:00
Pratik Mankawde
9515177e29 docs(telemetry): add missing config options to xrpld-example.cfg
Document service_instance_id, use_tls, tls_ca_cert, batch_size,
batch_delay_ms, and max_queue_size in the [telemetry] section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:20:18 +01:00
Pratik Mankawde
bd6e58a20e fix(telemetry): add missing span constants, fix test API, update levelization
Add unknownCommand and wsUpgrade span name constants to RpcSpanNames.h,
fix SpanGuardFactory tests to use the 3-argument SpanGuard::span() API,
update levelization results, and apply rename script to docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:21:13 +01:00
Pratik Mankawde
e8c826c816 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-29 11:17:19 +01:00
Pratik Mankawde
d753191d20 Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 11:16:51 +01:00
Pratik Mankawde
d4e91b462e fix(telemetry): resolve clang-tidy warnings in Telemetry interfaces
Use C++17 concatenated namespaces, add [[nodiscard]] to query methods,
add missing direct includes, and use pass-by-value + std::move in
NullTelemetry constructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:16:21 +01:00
Pratik Mankawde
fa2736277f Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-04-28 15:07:17 +01:00
Pratik Mankawde
196c309d1d Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
# Conflicts:
#	src/libxrpl/telemetry/Telemetry.cpp
2026-04-28 15:07:07 +01:00
Pratik Mankawde
d46d015fd5 fix(telemetry): fix include ordering in PathFind span files
Sort PathFindSpanNames.h after AssetCache.h alphabetically in
PathRequestManager.cpp and Pathfinder.cpp to satisfy the project's
include-order convention enforced by pre-commit hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:05:57 +01:00
Pratik Mankawde
999bf83f15 fix(telemetry): fix SpanGuard.cpp include ordering
Move SpanGuard.h (associated header) to first include position,
separated by blank line from other project includes, per the
project's include-order convention enforced by pre-commit hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:05:03 +01:00
Pratik Mankawde
96470e0c8d fix(telemetry): fix include ordering and markdown table formatting
Move Telemetry.h (associated header) to first include position in
Telemetry.cpp per the project's include-order convention. Trim
trailing whitespace from POC_taskList.md markdown table columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:04:09 +01:00
Pratik Mankawde
ed8164d502 docs(telemetry): add Task 2.9 PathFind instrumentation to Phase 2 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
682d7a8d76 feat(telemetry): add PathFind tracing with 5 spans (Tasks 2.9/2.10)
Instrument the path finding subsystem with full span coverage:

- pathfind.request: wraps doPathFind() and doRipplePathFind() RPC handlers
- pathfind.compute: wraps PathRequest::doUpdate() with fast/normal attr
- pathfind.update_all: wraps PathRequestManager::updateAll() on ledger
  close with ledger_index attr
- pathfind.discover: wraps Pathfinder::findPaths() graph exploration
  with search_level attr
- pathfind.rank: wraps Pathfinder::computePathRanks() liquidity
  validation with num_paths attr

New file: PathFindSpanNames.h with compile-time constants following
the StaticStr/join() pattern from Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
eb51457e69 fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
65817c4c57 fix(telemetry): align TelemetryConfig tests with current API
- serviceName default is "xrpld" not "rippled"
- Remove references to nonexistent exporterType field
- Pass networkId (4th param) to setup_Telemetry()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
9bc8cc6b4e docs(telemetry): update Phase2 task list to reflect actual implementation
Mark deferred tasks (2.1→Phase 3, 2.5→low priority) with rationale.
Mark superseded tasks (2.2→Phase 1c SpanGuard factory). Add Task 2.7
for Grafana search filters. Update summary table with status column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
832648c351 feat(telemetry): add RPC trace filters and SpanGuard unit tests
- Grafana Tempo datasource: add rpc-command, rpc-status, rpc-role
  search filters for the Explore UI
- Unit tests: TelemetryConfig (config parsing defaults and sections),
  SpanGuardFactory (null guard safety, move semantics, discard, all
  factory methods)
- Test CMake registration with optional OTel linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
21b58a8885 feat(telemetry): add node health attributes to RPC spans (Task 2.8)
Add amendment_blocked and server_state span attributes to every
rpc.command.* span so operators can correlate RPC behavior with node state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
a9ee819ea1 docs(telemetry): add Phase 2-5 task lists and appendix update
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
736579e473 refactor(telemetry): extract span name constants into modular headers
Centralise scattered string literals into compile-time constants using
StaticStr<N> and join() for dot-separated composition. Shared primitives
live in SpanNames.h; RPC-specific names in RpcSpanNames.h. Future modules
(consensus, peer, ledger) add their own *SpanNames.h without bloating
the central header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00