diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index 6523a53d09..7ced4aaf2b 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -346,20 +346,20 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = The following table summarizes what data is collected by category: -| Category | Attributes Collected | Purpose | -| --------------- | ---------------------------------------------------------------------- | ---------------------------- | -| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | -| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing | -| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | -| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | -| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | -| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | -| **PathFinding** | `pathfind.source_currency`, `dest_currency`, `path_count`, `cache_hit` | Payment path analysis | -| **TxQ** | `txq.queue_depth`, `fee_level`, `eviction_reason` | Queue depth and fee tracking | -| **Fee** | `fee.load_factor`, `escalation_level` | Fee escalation monitoring | -| **Validator** | `validator.list_size`, `list_age_sec` | UNL health monitoring | -| **Amendment** | `amendment.name`, `status` | Protocol upgrade tracking | -| **SHAMap** | `shamap.type`, `missing_nodes`, `duration_ms` | State tree sync performance | +| Category | Attributes Collected | Purpose | +| --------------- | ---------------------------------------------------------------------------------------------------------------- | ---------------------------- | +| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | +| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing | +| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | +| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | +| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | +| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | +| **PathFinding** | `pathfind_fast`, `pathfind_search_level`, `pathfind_num_paths`, `pathfind_ledger_index`, `pathfind_num_requests` | Payment path analysis | +| **TxQ** | `txq.queue_depth`, `fee_level`, `eviction_reason` | Queue depth and fee tracking | +| **Fee** | `fee.load_factor`, `escalation_level` | Fee escalation monitoring | +| **Validator** | `validator.list_size`, `list_age_sec` | UNL health monitoring | +| **Amendment** | `amendment.name`, `status` | Protocol upgrade tracking | +| **SHAMap** | `shamap.type`, `missing_nodes`, `duration_ms` | State tree sync performance | ### 2.4.4 Privacy & Sensitive Data Policy diff --git a/OpenTelemetryPlan/Phase2_taskList.md b/OpenTelemetryPlan/Phase2_taskList.md index 6979f78869..6b64210e6b 100644 --- a/OpenTelemetryPlan/Phase2_taskList.md +++ b/OpenTelemetryPlan/Phase2_taskList.md @@ -1,8 +1,8 @@ # Phase 2: RPC Tracing Completion Task List -> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, node health attributes, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability. +> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, PathFind instrumentation, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability. > -> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, node health span attributes, config validation (`std::clamp`). +> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, PathFind RPC tracing, config validation (`std::clamp`). > > **Branch**: `pratik/otel-phase2-rpc-tracing` (from `pratik/otel-phase1c-rpc-integration`) @@ -121,42 +121,9 @@ These can be added later if dashboard queries specifically need them. The node h ## Task 2.8: RPC Span Attribute Enrichment — Node Health Context -> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — adds node-level health context inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). -> -> **Downstream**: Phase 7 (MetricsRegistry uses these attributes for alerting context), Phase 10 (validation checks for these attributes). +**Status**: DROPPED. -**Objective**: Add node-level health state to every `rpc.command.*` span so operators can correlate RPC behavior with node state in Tempo. - -**What to do**: - -- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`: - - In the `rpc.command.*` span creation block (after existing `setAttribute` calls for `command`, `version`, etc.): - - Node health attrs (`xrpl.node.amendment_blocked`, `xrpl.node.server_state`) are now resource-level attrs, not per-span. They are set at Tracer init. - -**New span attributes**: - -| Attribute | Type | Source | Example | -| ----------------------------- | ------ | ------------------------------------------- | -------- | -| `xrpl.node.amendment_blocked` | bool | `context.app.getOPs().isAmendmentBlocked()` | `true` | -| `xrpl.node.server_state` | string | `context.app.getOPs().strOperatingMode()` | `"full"` | - -**Rationale**: When a node is amendment-blocked or in a degraded state, every RPC response is suspect. Tagging spans with this state enables Tempo TraceQL queries like: - -``` -{name=~"rpc.command.*"} | xrpl.node.amendment_blocked = true -``` - -This surfaces all RPCs served during a blocked period — critical for post-incident analysis. - -**Key modified files**: - -- `src/xrpld/rpc/detail/RPCHandler.cpp` - -**Exit Criteria**: - -- [ ] `rpc.command.server_info` spans carry `xrpl.node.amendment_blocked` and `xrpl.node.server_state` attributes -- [ ] No measurable latency impact (attribute values are cached atomics, not computed per-call) -- [ ] Attributes appear in Tempo trace detail view +Node health (`amendment_blocked`, `server_state`) is not part of the telemetry surface. Operators consume the same data via the existing `server_info` / `server_state` RPC commands, so duplicating it on traces adds storage and cardinality cost without new value. The OTel C++ SDK 1.18.0 also does not support runtime updates to the resource, ruling out resource-level emission of these dynamic-by-nature flags. --- @@ -169,10 +136,11 @@ This surfaces all RPCs served during a blocked period — critical for post-inci **Spans added**: - `pathfind.request` — wraps `doPathFind()` and `doRipplePathFind()` RPC handlers -- `pathfind.compute` — wraps `PathRequest::doUpdate()` (fast/normal attr) -- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (ledger_index attr) -- `pathfind.discover` — wraps `Pathfinder::findPaths()` graph exploration (search_level attr) -- `pathfind.rank` — wraps `Pathfinder::computePathRanks()` liquidity validation (num_paths attr) +- `pathfind.compute` — wraps `PathRequest::doUpdate()` (`pathfind_fast` attr) +- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (`pathfind_ledger_index`, `pathfind_num_requests` attrs; emitted only when active subscriptions exist) +- `pathfind.discover` — wraps the entire per-source-asset loop in `PathRequest::findPaths()` (`pathfind_search_level`, `pathfind_num_paths` attrs). One span per RPC call instead of N (one per source asset). Trade-off: per-asset breakdown is lost; storage and cardinality bounded. + +**Attribute namespacing**: All pathfind attributes use the `pathfind_*` underscore form per the Phase 1c naming-spec rule 5. **New file**: `src/xrpld/rpc/detail/PathFindSpanNames.h` @@ -197,9 +165,10 @@ This surfaces all RPCs served during a blocked period — critical for post-inci | 2.5 | Enhanced RPC span attributes (HTTP-level) | Deferred | Low value; span duration covers timing natively | | 2.6 | Build verification and performance baseline | Complete | Verified in CI on Phase 1c | | 2.7 | Grafana Tempo search filters | Complete | rpc-command, rpc-status, rpc-role filters | -| 2.8 | RPC span attribute enrichment (node health) | Complete | amendment_blocked + server_state | -| 2.9 | PathFind RPC instrumentation (5 spans) | Complete | request, compute, update_all, discover, rank | +| 2.8 | RPC span attribute enrichment (node health) | Dropped | Available via `server_info`/`server_state` RPC | +| 2.9 | PathFind RPC instrumentation | Complete | request, compute, update_all, discover | -**Delivered in this branch**: Tasks 2.4, 2.7, 2.8, 2.9. +**Delivered in this branch**: Tasks 2.4, 2.7, 2.9. **Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority). +**Dropped**: Task 2.8 (node health not duplicated on traces). **Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this). diff --git a/docker/telemetry/grafana/provisioning/datasources/tempo.yaml b/docker/telemetry/grafana/provisioning/datasources/tempo.yaml index 7f4bd3684f..1af993961a 100644 --- a/docker/telemetry/grafana/provisioning/datasources/tempo.yaml +++ b/docker/telemetry/grafana/provisioning/datasources/tempo.yaml @@ -106,14 +106,3 @@ datasources: operator: "=" scope: span type: dynamic - # Phase 2: Node health filters (Task 2.8) — resource attributes - - id: node-amendment-blocked - tag: xrpl.node.amendment_blocked - operator: "=" - scope: resource - type: static - - id: node-server-state - tag: xrpl.node.server_state - operator: "=" - scope: resource - type: dynamic diff --git a/include/xrpl/telemetry/SpanNames.h b/include/xrpl/telemetry/SpanNames.h index 8ea05fe2e0..fe12e3bdc5 100644 --- a/include/xrpl/telemetry/SpanNames.h +++ b/include/xrpl/telemetry/SpanNames.h @@ -108,15 +108,6 @@ namespace attr { inline constexpr auto networkId = join(join(seg::xrpl, seg::network), makeStr("id")); inline constexpr auto networkType = join(join(seg::xrpl, seg::network), makeStr("type")); inline constexpr auto linkType = makeStr("link_type"); - -/// Node health attributes — RESOURCE-ONLY (process identity, not per-span). -/// Set at Tracer init via resource::Resource::Create and refreshed on state -/// transitions. Do NOT use with span.setAttribute(). -inline constexpr auto xrplNode = join(seg::xrpl, makeStr("node")); -/// "xrpl.node.amendment_blocked" — resource attribute key. -inline constexpr auto nodeAmendmentBlocked = join(xrplNode, makeStr("amendment_blocked")); -/// "xrpl.node.server_state" — resource attribute key. -inline constexpr auto nodeServerState = join(xrplNode, makeStr("server_state")); } // namespace attr // ===== Shared attribute values ============================================= diff --git a/src/xrpld/rpc/detail/PathFindSpanNames.h b/src/xrpld/rpc/detail/PathFindSpanNames.h index af61340b6f..17a1c12858 100644 --- a/src/xrpld/rpc/detail/PathFindSpanNames.h +++ b/src/xrpld/rpc/detail/PathFindSpanNames.h @@ -9,34 +9,36 @@ * * RPC entry (one-shot or subscription): * - * +-------------------------------------------------------+ - * | pathfind.request | - * | doPathFind() / doRipplePathFind() | - * | attrs: source_account, dest_account | - * | | - * | +--------------------------------------------------+ | - * | | pathfind.compute | | - * | | PathRequest::doUpdate() | | - * | | attrs: fast, search_level | | - * | | | | - * | | +---------------------+ +--------------------+ | | - * | | | pathfind.discover | | pathfind.rank | | | - * | | | Pathfinder::find() | | computePathRanks() | | | - * | | +---------------------+ +--------------------+ | | - * | +--------------------------------------------------+ | - * +-------------------------------------------------------+ + * +----------------------------------------------------------------+ + * | pathfind.request | + * | doPathFind() / doRipplePathFind() | + * | attrs: pathfind_source_account, pathfind_dest_account | + * | (set when present in request params) | + * | | + * | +-----------------------------------------------------------+ | + * | | pathfind.compute | | + * | | PathRequest::doUpdate() | | + * | | attrs: pathfind_fast | | + * | | | | + * | | +-----------------------------------------------------+ | | + * | | | pathfind.discover (one per RPC call, hoisted above | | + * | | | the per-source-asset loop in PathRequest::findPaths)| | + * | | | attrs: pathfind_search_level, pathfind_num_paths | | + * | | +-----------------------------------------------------+ | | + * | +-----------------------------------------------------------+ | + * +----------------------------------------------------------------+ * * Async recomputation (ledger close): * - * +-------------------------------------------------------+ - * | pathfind.update_all | - * | PathRequestManager::updateAll() | - * | attrs: ledger_index, num_requests | - * | | - * | +--------------------------------------------------+ | - * | | pathfind.compute (per active request) | | - * | +--------------------------------------------------+ | - * +-------------------------------------------------------+ + * +----------------------------------------------------------------+ + * | pathfind.update_all | + * | PathRequestManager::updateAll() | + * | attrs: pathfind_ledger_index, pathfind_num_requests | + * | | + * | +-----------------------------------------------------------+ | + * | | pathfind.compute (per active request) | | + * | +-----------------------------------------------------------+ | + * +----------------------------------------------------------------+ */ #include @@ -57,30 +59,31 @@ inline constexpr auto request = makeStr("request"); inline constexpr auto compute = makeStr("compute"); inline constexpr auto updateAll = makeStr("update_all"); inline constexpr auto discover = makeStr("discover"); -inline constexpr auto rank = makeStr("rank"); } // namespace op // ===== Attribute keys ====================================================== +// +// All pathfind attributes are namespaced under `pathfind_*` (underscore form, +// per Phase 1c naming spec rule 5). Avoids collisions with bare keys like +// `fast` or `num_paths` that other subsystems may introduce. namespace attr { -/// "source_account" — originating account for path search. -inline constexpr auto sourceAccount = makeStr("source_account"); -/// "dest_account" — destination account. -inline constexpr auto destAccount = makeStr("dest_account"); -/// "fast" — whether fast pathfinding mode enabled. -inline constexpr auto fast = makeStr("fast"); -/// "search_level" — depth of graph exploration. -inline constexpr auto searchLevel = makeStr("search_level"); -/// "num_complete_paths" — complete paths found. -inline constexpr auto numCompletePaths = makeStr("num_complete_paths"); -/// "num_paths" — total paths returned. -inline constexpr auto numPaths = makeStr("num_paths"); -/// "num_requests" — active path requests. -inline constexpr auto numRequests = makeStr("num_requests"); -/// "xrpl.pathfind.ledger_index" — kept qualified (rule 5): pathfind target -/// ledger is distinct from xrpl.ledger.seq. -inline constexpr auto ledgerIndex = - join(join(seg::xrpl, makeStr("pathfind")), makeStr("ledger_index")); +/// "pathfind_source_account" — originating account for path search. +inline constexpr auto sourceAccount = makeStr("pathfind_source_account"); +/// "pathfind_dest_account" — destination account. +inline constexpr auto destAccount = makeStr("pathfind_dest_account"); +/// "pathfind_fast" — whether fast pathfinding mode enabled. +inline constexpr auto fast = makeStr("pathfind_fast"); +/// "pathfind_search_level" — depth of graph exploration. +inline constexpr auto searchLevel = makeStr("pathfind_search_level"); +/// "pathfind_num_paths" — total paths produced across the per-source-asset +/// loop in PathRequest::findPaths (sum of getBestPaths().size() per asset). +inline constexpr auto numPaths = makeStr("pathfind_num_paths"); +/// "pathfind_num_requests" — snapshot size of requests_ at update_all start +/// (may include weak_ptrs that subsequently expire during processing). +inline constexpr auto numRequests = makeStr("pathfind_num_requests"); +/// "pathfind_ledger_index" — pathfind target ledger index. +inline constexpr auto ledgerIndex = makeStr("pathfind_ledger_index"); } // namespace attr } // namespace xrpl::telemetry::pathfind_span diff --git a/src/xrpld/rpc/detail/PathRequest.cpp b/src/xrpld/rpc/detail/PathRequest.cpp index 8148879362..7b54872cbc 100644 --- a/src/xrpld/rpc/detail/PathRequest.cpp +++ b/src/xrpld/rpc/detail/PathRequest.cpp @@ -40,6 +40,7 @@ #include #include +#include #include #include #include @@ -579,6 +580,20 @@ PathRequest::findPaths( auto const dst_amount = convertAmount(saDstAmount, convert_all_); hash_map> pathasset_map; + + // One `pathfind.discover` span wraps the entire per-source-asset loop so + // that a single RPC call produces one discover span instead of N (one per + // candidate source asset). Trade-off: per-asset discovery/ranking timing + // is no longer split into individual spans — span count and Tempo storage + // are bounded per RPC at the cost of per-asset visibility. If per-asset + // breakdown is needed in the future, add child spans inside the loop body + // (`Pathfinder::findPaths`/`computePathRanks`) parented off this span. + using namespace telemetry; + auto span = SpanGuard::span( + TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::discover); + span.setAttribute(pathfind_span::attr::searchLevel, static_cast(level)); + + std::int64_t totalPaths = 0; for (auto const& asset : sourceAssets) { if (continueCallback && !continueCallback()) @@ -598,6 +613,7 @@ PathRequest::findPaths( auto ps = pathfinder->getBestPaths( max_paths_, fullLiquidityPath, mContext[asset], asset.getIssuer(), continueCallback); mContext[asset] = ps; + totalPaths += static_cast(ps.size()); auto const& sourceAccount = [&] { if (!isXRP(asset.getIssuer())) @@ -697,6 +713,8 @@ PathRequest::findPaths( } } + span.setAttribute(pathfind_span::attr::numPaths, totalPaths); + /* The resource fee is based on the number of source currencies used. The minimum cost is 50 and the maximum is 400. The cost increases after four source currencies, 50 - (4 * 4) = 34. diff --git a/src/xrpld/rpc/detail/PathRequestManager.cpp b/src/xrpld/rpc/detail/PathRequestManager.cpp index 16e5def7c3..9577e500de 100644 --- a/src/xrpld/rpc/detail/PathRequestManager.cpp +++ b/src/xrpld/rpc/detail/PathRequestManager.cpp @@ -61,11 +61,6 @@ PathRequestManager::getAssetCache(std::shared_ptr const& ledger, void PathRequestManager::updateAll(std::shared_ptr const& inLedger) { - using namespace telemetry; - auto span = SpanGuard::span( - TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::updateAll); - span.setAttribute(pathfind_span::attr::ledgerIndex, static_cast(inLedger->seq())); - auto event = app_.getJobQueue().makeLoadEvent(jtPATH_FIND, "PathRequest::updateAll"); std::vector requests; @@ -78,6 +73,18 @@ PathRequestManager::updateAll(std::shared_ptr const& inLedger) cache = getAssetCache(inLedger, true); } + // updateAll runs on every ledger close; skip span emission entirely when + // there are no active path subscriptions to avoid a steady stream of empty + // spans at mainnet close cadence. + if (requests.empty()) + return; + + using namespace telemetry; + auto span = SpanGuard::span( + TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::updateAll); + span.setAttribute(pathfind_span::attr::ledgerIndex, static_cast(inLedger->seq())); + span.setAttribute(pathfind_span::attr::numRequests, static_cast(requests.size())); + bool newRequests = app_.getLedgerMaster().isNewPathRequest(); bool mustBreak = false; diff --git a/src/xrpld/rpc/detail/Pathfinder.cpp b/src/xrpld/rpc/detail/Pathfinder.cpp index 1f1034739e..a31d4522a7 100644 --- a/src/xrpld/rpc/detail/Pathfinder.cpp +++ b/src/xrpld/rpc/detail/Pathfinder.cpp @@ -2,7 +2,6 @@ #include #include -#include #include #include #include @@ -30,7 +29,6 @@ #include #include #include -#include #include #include @@ -229,11 +227,6 @@ Pathfinder::Pathfinder( bool Pathfinder::findPaths(int searchLevel, std::function const& continueCallback) { - using namespace telemetry; - auto span = SpanGuard::span( - TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::discover); - span.setAttribute(pathfind_span::attr::searchLevel, static_cast(searchLevel)); - JLOG(j_.trace()) << "findPaths start"; if (mDstAmount == beast::zero) { @@ -444,11 +437,6 @@ Pathfinder::getPathLiquidity( void Pathfinder::computePathRanks(int maxPaths, std::function const& continueCallback) { - using namespace telemetry; - auto span = SpanGuard::span( - TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::rank); - span.setAttribute(pathfind_span::attr::numPaths, static_cast(maxPaths)); - mRemainingAmount = convertAmount(mDstAmount, convert_all_); // Must subtract liquidity in default path from remaining amount. diff --git a/src/xrpld/rpc/handlers/orderbook/PathFind.cpp b/src/xrpld/rpc/handlers/orderbook/PathFind.cpp index 9aa84ac225..b40c74eedc 100644 --- a/src/xrpld/rpc/handlers/orderbook/PathFind.cpp +++ b/src/xrpld/rpc/handlers/orderbook/PathFind.cpp @@ -18,8 +18,13 @@ Json::Value doPathFind(RPC::JsonContext& context) { using namespace telemetry; - [[maybe_unused]] auto span = SpanGuard::span( + auto span = SpanGuard::span( TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::request); + if (auto const& src = context.params[jss::source_account]; src.isString()) + span.setAttribute(pathfind_span::attr::sourceAccount, src.asString()); + if (auto const& dst = context.params[jss::destination_account]; dst.isString()) + span.setAttribute(pathfind_span::attr::destAccount, dst.asString()); + if (context.app.config().PATH_SEARCH_MAX == 0) return rpcError(rpcNOT_SUPPORTED); diff --git a/src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp b/src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp index ff4ed3eb39..0bc751fcb1 100644 --- a/src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp +++ b/src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp @@ -26,8 +26,13 @@ Json::Value doRipplePathFind(RPC::JsonContext& context) { using namespace telemetry; - [[maybe_unused]] auto span = SpanGuard::span( + auto span = SpanGuard::span( TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::request); + if (auto const& src = context.params[jss::source_account]; src.isString()) + span.setAttribute(pathfind_span::attr::sourceAccount, src.asString()); + if (auto const& dst = context.params[jss::destination_account]; dst.isString()) + span.setAttribute(pathfind_span::attr::destAccount, dst.asString()); + if (context.app.config().PATH_SEARCH_MAX == 0) return rpcError(rpcNOT_SUPPORTED);