fix(telemetry): address PR #6424 review comments

- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-05-28 11:27:29 +01:00
parent 64ffcffe32
commit 9498b2865f
10 changed files with 116 additions and 141 deletions

View File

@@ -346,20 +346,20 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
The following table summarizes what data is collected by category:
| Category | Attributes Collected | Purpose |
| --------------- | ---------------------------------------------------------------------- | ---------------------------- |
| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle |
| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing |
| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance |
| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis |
| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking |
| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance |
| **PathFinding** | `pathfind.source_currency`, `dest_currency`, `path_count`, `cache_hit` | Payment path analysis |
| **TxQ** | `txq.queue_depth`, `fee_level`, `eviction_reason` | Queue depth and fee tracking |
| **Fee** | `fee.load_factor`, `escalation_level` | Fee escalation monitoring |
| **Validator** | `validator.list_size`, `list_age_sec` | UNL health monitoring |
| **Amendment** | `amendment.name`, `status` | Protocol upgrade tracking |
| **SHAMap** | `shamap.type`, `missing_nodes`, `duration_ms` | State tree sync performance |
| Category | Attributes Collected | Purpose |
| --------------- | ---------------------------------------------------------------------------------------------------------------- | ---------------------------- |
| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle |
| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing |
| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance |
| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis |
| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking |
| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance |
| **PathFinding** | `pathfind_fast`, `pathfind_search_level`, `pathfind_num_paths`, `pathfind_ledger_index`, `pathfind_num_requests` | Payment path analysis |
| **TxQ** | `txq.queue_depth`, `fee_level`, `eviction_reason` | Queue depth and fee tracking |
| **Fee** | `fee.load_factor`, `escalation_level` | Fee escalation monitoring |
| **Validator** | `validator.list_size`, `list_age_sec` | UNL health monitoring |
| **Amendment** | `amendment.name`, `status` | Protocol upgrade tracking |
| **SHAMap** | `shamap.type`, `missing_nodes`, `duration_ms` | State tree sync performance |
### 2.4.4 Privacy & Sensitive Data Policy

View File

@@ -1,8 +1,8 @@
# Phase 2: RPC Tracing Completion Task List
> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, node health attributes, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability.
> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, PathFind instrumentation, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability.
>
> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, node health span attributes, config validation (`std::clamp`).
> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, PathFind RPC tracing, config validation (`std::clamp`).
>
> **Branch**: `pratik/otel-phase2-rpc-tracing` (from `pratik/otel-phase1c-rpc-integration`)
@@ -121,42 +121,9 @@ These can be added later if dashboard queries specifically need them. The node h
## Task 2.8: RPC Span Attribute Enrichment — Node Health Context
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — adds node-level health context inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
>
> **Downstream**: Phase 7 (MetricsRegistry uses these attributes for alerting context), Phase 10 (validation checks for these attributes).
**Status**: DROPPED.
**Objective**: Add node-level health state to every `rpc.command.*` span so operators can correlate RPC behavior with node state in Tempo.
**What to do**:
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
- In the `rpc.command.*` span creation block (after existing `setAttribute` calls for `command`, `version`, etc.):
- Node health attrs (`xrpl.node.amendment_blocked`, `xrpl.node.server_state`) are now resource-level attrs, not per-span. They are set at Tracer init.
**New span attributes**:
| Attribute | Type | Source | Example |
| ----------------------------- | ------ | ------------------------------------------- | -------- |
| `xrpl.node.amendment_blocked` | bool | `context.app.getOPs().isAmendmentBlocked()` | `true` |
| `xrpl.node.server_state` | string | `context.app.getOPs().strOperatingMode()` | `"full"` |
**Rationale**: When a node is amendment-blocked or in a degraded state, every RPC response is suspect. Tagging spans with this state enables Tempo TraceQL queries like:
```
{name=~"rpc.command.*"} | xrpl.node.amendment_blocked = true
```
This surfaces all RPCs served during a blocked period — critical for post-incident analysis.
**Key modified files**:
- `src/xrpld/rpc/detail/RPCHandler.cpp`
**Exit Criteria**:
- [ ] `rpc.command.server_info` spans carry `xrpl.node.amendment_blocked` and `xrpl.node.server_state` attributes
- [ ] No measurable latency impact (attribute values are cached atomics, not computed per-call)
- [ ] Attributes appear in Tempo trace detail view
Node health (`amendment_blocked`, `server_state`) is not part of the telemetry surface. Operators consume the same data via the existing `server_info` / `server_state` RPC commands, so duplicating it on traces adds storage and cardinality cost without new value. The OTel C++ SDK 1.18.0 also does not support runtime updates to the resource, ruling out resource-level emission of these dynamic-by-nature flags.
---
@@ -169,10 +136,11 @@ This surfaces all RPCs served during a blocked period — critical for post-inci
**Spans added**:
- `pathfind.request` — wraps `doPathFind()` and `doRipplePathFind()` RPC handlers
- `pathfind.compute` — wraps `PathRequest::doUpdate()` (fast/normal attr)
- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (ledger_index attr)
- `pathfind.discover` — wraps `Pathfinder::findPaths()` graph exploration (search_level attr)
- `pathfind.rank` — wraps `Pathfinder::computePathRanks()` liquidity validation (num_paths attr)
- `pathfind.compute` — wraps `PathRequest::doUpdate()` (`pathfind_fast` attr)
- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (`pathfind_ledger_index`, `pathfind_num_requests` attrs; emitted only when active subscriptions exist)
- `pathfind.discover` — wraps the entire per-source-asset loop in `PathRequest::findPaths()` (`pathfind_search_level`, `pathfind_num_paths` attrs). One span per RPC call instead of N (one per source asset). Trade-off: per-asset breakdown is lost; storage and cardinality bounded.
**Attribute namespacing**: All pathfind attributes use the `pathfind_*` underscore form per the Phase 1c naming-spec rule 5.
**New file**: `src/xrpld/rpc/detail/PathFindSpanNames.h`
@@ -197,9 +165,10 @@ This surfaces all RPCs served during a blocked period — critical for post-inci
| 2.5 | Enhanced RPC span attributes (HTTP-level) | Deferred | Low value; span duration covers timing natively |
| 2.6 | Build verification and performance baseline | Complete | Verified in CI on Phase 1c |
| 2.7 | Grafana Tempo search filters | Complete | rpc-command, rpc-status, rpc-role filters |
| 2.8 | RPC span attribute enrichment (node health) | Complete | amendment_blocked + server_state |
| 2.9 | PathFind RPC instrumentation (5 spans) | Complete | request, compute, update_all, discover, rank |
| 2.8 | RPC span attribute enrichment (node health) | Dropped | Available via `server_info`/`server_state` RPC |
| 2.9 | PathFind RPC instrumentation | Complete | request, compute, update_all, discover |
**Delivered in this branch**: Tasks 2.4, 2.7, 2.8, 2.9.
**Delivered in this branch**: Tasks 2.4, 2.7, 2.9.
**Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority).
**Dropped**: Task 2.8 (node health not duplicated on traces).
**Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this).

View File

@@ -106,14 +106,3 @@ datasources:
operator: "="
scope: span
type: dynamic
# Phase 2: Node health filters (Task 2.8) — resource attributes
- id: node-amendment-blocked
tag: xrpl.node.amendment_blocked
operator: "="
scope: resource
type: static
- id: node-server-state
tag: xrpl.node.server_state
operator: "="
scope: resource
type: dynamic

View File

@@ -108,15 +108,6 @@ namespace attr {
inline constexpr auto networkId = join(join(seg::xrpl, seg::network), makeStr("id"));
inline constexpr auto networkType = join(join(seg::xrpl, seg::network), makeStr("type"));
inline constexpr auto linkType = makeStr("link_type");
/// Node health attributes — RESOURCE-ONLY (process identity, not per-span).
/// Set at Tracer init via resource::Resource::Create and refreshed on state
/// transitions. Do NOT use with span.setAttribute().
inline constexpr auto xrplNode = join(seg::xrpl, makeStr("node"));
/// "xrpl.node.amendment_blocked" — resource attribute key.
inline constexpr auto nodeAmendmentBlocked = join(xrplNode, makeStr("amendment_blocked"));
/// "xrpl.node.server_state" — resource attribute key.
inline constexpr auto nodeServerState = join(xrplNode, makeStr("server_state"));
} // namespace attr
// ===== Shared attribute values =============================================

View File

@@ -9,34 +9,36 @@
*
* RPC entry (one-shot or subscription):
*
* +-------------------------------------------------------+
* | pathfind.request |
* | doPathFind() / doRipplePathFind() |
* | attrs: source_account, dest_account |
* | |
* | +--------------------------------------------------+ |
* | | pathfind.compute | |
* | | PathRequest::doUpdate() | |
* | | attrs: fast, search_level | |
* | | | |
* | | +---------------------+ +--------------------+ | |
* | | | pathfind.discover | | pathfind.rank | | |
* | | | Pathfinder::find() | | computePathRanks() | | |
* | | +---------------------+ +--------------------+ | |
* | +--------------------------------------------------+ |
* +-------------------------------------------------------+
* +----------------------------------------------------------------+
* | pathfind.request |
* | doPathFind() / doRipplePathFind() |
* | attrs: pathfind_source_account, pathfind_dest_account |
* | (set when present in request params) |
* | |
* | +-----------------------------------------------------------+ |
* | | pathfind.compute | |
* | | PathRequest::doUpdate() | |
* | | attrs: pathfind_fast | |
* | | | |
* | | +-----------------------------------------------------+ | |
* | | | pathfind.discover (one per RPC call, hoisted above | |
* | | | the per-source-asset loop in PathRequest::findPaths)| |
* | | | attrs: pathfind_search_level, pathfind_num_paths | |
* | | +-----------------------------------------------------+ | |
* | +-----------------------------------------------------------+ |
* +----------------------------------------------------------------+
*
* Async recomputation (ledger close):
*
* +-------------------------------------------------------+
* | pathfind.update_all |
* | PathRequestManager::updateAll() |
* | attrs: ledger_index, num_requests |
* | |
* | +--------------------------------------------------+ |
* | | pathfind.compute (per active request) | |
* | +--------------------------------------------------+ |
* +-------------------------------------------------------+
* +----------------------------------------------------------------+
* | pathfind.update_all |
* | PathRequestManager::updateAll() |
* | attrs: pathfind_ledger_index, pathfind_num_requests |
* | |
* | +-----------------------------------------------------------+ |
* | | pathfind.compute (per active request) | |
* | +-----------------------------------------------------------+ |
* +----------------------------------------------------------------+
*/
#include <xrpl/telemetry/SpanNames.h>
@@ -57,30 +59,31 @@ inline constexpr auto request = makeStr("request");
inline constexpr auto compute = makeStr("compute");
inline constexpr auto updateAll = makeStr("update_all");
inline constexpr auto discover = makeStr("discover");
inline constexpr auto rank = makeStr("rank");
} // namespace op
// ===== Attribute keys ======================================================
//
// All pathfind attributes are namespaced under `pathfind_*` (underscore form,
// per Phase 1c naming spec rule 5). Avoids collisions with bare keys like
// `fast` or `num_paths` that other subsystems may introduce.
namespace attr {
/// "source_account" — originating account for path search.
inline constexpr auto sourceAccount = makeStr("source_account");
/// "dest_account" — destination account.
inline constexpr auto destAccount = makeStr("dest_account");
/// "fast" — whether fast pathfinding mode enabled.
inline constexpr auto fast = makeStr("fast");
/// "search_level" — depth of graph exploration.
inline constexpr auto searchLevel = makeStr("search_level");
/// "num_complete_paths" — complete paths found.
inline constexpr auto numCompletePaths = makeStr("num_complete_paths");
/// "num_paths" — total paths returned.
inline constexpr auto numPaths = makeStr("num_paths");
/// "num_requests" — active path requests.
inline constexpr auto numRequests = makeStr("num_requests");
/// "xrpl.pathfind.ledger_index" — kept qualified (rule 5): pathfind target
/// ledger is distinct from xrpl.ledger.seq.
inline constexpr auto ledgerIndex =
join(join(seg::xrpl, makeStr("pathfind")), makeStr("ledger_index"));
/// "pathfind_source_account" — originating account for path search.
inline constexpr auto sourceAccount = makeStr("pathfind_source_account");
/// "pathfind_dest_account" — destination account.
inline constexpr auto destAccount = makeStr("pathfind_dest_account");
/// "pathfind_fast" — whether fast pathfinding mode enabled.
inline constexpr auto fast = makeStr("pathfind_fast");
/// "pathfind_search_level" — depth of graph exploration.
inline constexpr auto searchLevel = makeStr("pathfind_search_level");
/// "pathfind_num_paths" — total paths produced across the per-source-asset
/// loop in PathRequest::findPaths (sum of getBestPaths().size() per asset).
inline constexpr auto numPaths = makeStr("pathfind_num_paths");
/// "pathfind_num_requests" — snapshot size of requests_ at update_all start
/// (may include weak_ptrs that subsequently expire during processing).
inline constexpr auto numRequests = makeStr("pathfind_num_requests");
/// "pathfind_ledger_index" — pathfind target ledger index.
inline constexpr auto ledgerIndex = makeStr("pathfind_ledger_index");
} // namespace attr
} // namespace xrpl::telemetry::pathfind_span

View File

@@ -40,6 +40,7 @@
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <functional>
#include <memory>
#include <mutex>
@@ -579,6 +580,20 @@ PathRequest::findPaths(
auto const dst_amount = convertAmount(saDstAmount, convert_all_);
hash_map<PathAsset, std::unique_ptr<Pathfinder>> pathasset_map;
// One `pathfind.discover` span wraps the entire per-source-asset loop so
// that a single RPC call produces one discover span instead of N (one per
// candidate source asset). Trade-off: per-asset discovery/ranking timing
// is no longer split into individual spans — span count and Tempo storage
// are bounded per RPC at the cost of per-asset visibility. If per-asset
// breakdown is needed in the future, add child spans inside the loop body
// (`Pathfinder::findPaths`/`computePathRanks`) parented off this span.
using namespace telemetry;
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::discover);
span.setAttribute(pathfind_span::attr::searchLevel, static_cast<int64_t>(level));
std::int64_t totalPaths = 0;
for (auto const& asset : sourceAssets)
{
if (continueCallback && !continueCallback())
@@ -598,6 +613,7 @@ PathRequest::findPaths(
auto ps = pathfinder->getBestPaths(
max_paths_, fullLiquidityPath, mContext[asset], asset.getIssuer(), continueCallback);
mContext[asset] = ps;
totalPaths += static_cast<std::int64_t>(ps.size());
auto const& sourceAccount = [&] {
if (!isXRP(asset.getIssuer()))
@@ -697,6 +713,8 @@ PathRequest::findPaths(
}
}
span.setAttribute(pathfind_span::attr::numPaths, totalPaths);
/* The resource fee is based on the number of source currencies used.
The minimum cost is 50 and the maximum is 400. The cost increases
after four source currencies, 50 - (4 * 4) = 34.

View File

@@ -61,11 +61,6 @@ PathRequestManager::getAssetCache(std::shared_ptr<ReadView const> const& ledger,
void
PathRequestManager::updateAll(std::shared_ptr<ReadView const> const& inLedger)
{
using namespace telemetry;
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::updateAll);
span.setAttribute(pathfind_span::attr::ledgerIndex, static_cast<int64_t>(inLedger->seq()));
auto event = app_.getJobQueue().makeLoadEvent(jtPATH_FIND, "PathRequest::updateAll");
std::vector<PathRequest::wptr> requests;
@@ -78,6 +73,18 @@ PathRequestManager::updateAll(std::shared_ptr<ReadView const> const& inLedger)
cache = getAssetCache(inLedger, true);
}
// updateAll runs on every ledger close; skip span emission entirely when
// there are no active path subscriptions to avoid a steady stream of empty
// spans at mainnet close cadence.
if (requests.empty())
return;
using namespace telemetry;
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::updateAll);
span.setAttribute(pathfind_span::attr::ledgerIndex, static_cast<int64_t>(inLedger->seq()));
span.setAttribute(pathfind_span::attr::numRequests, static_cast<int64_t>(requests.size()));
bool newRequests = app_.getLedgerMaster().isNewPathRequest();
bool mustBreak = false;

View File

@@ -2,7 +2,6 @@
#include <xrpld/app/main/Application.h>
#include <xrpld/rpc/detail/AssetCache.h>
#include <xrpld/rpc/detail/PathFindSpanNames.h>
#include <xrpld/rpc/detail/PathfinderUtils.h>
#include <xrpld/rpc/detail/RippleLineCache.h>
#include <xrpld/rpc/detail/TrustLine.h>
@@ -30,7 +29,6 @@
#include <xrpl/protocol/STPathSet.h>
#include <xrpl/protocol/TER.h>
#include <xrpl/protocol/UintTypes.h>
#include <xrpl/telemetry/SpanGuard.h>
#include <xrpl/tx/paths/RippleCalc.h>
#include <algorithm>
@@ -229,11 +227,6 @@ Pathfinder::Pathfinder(
bool
Pathfinder::findPaths(int searchLevel, std::function<bool(void)> const& continueCallback)
{
using namespace telemetry;
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::discover);
span.setAttribute(pathfind_span::attr::searchLevel, static_cast<int64_t>(searchLevel));
JLOG(j_.trace()) << "findPaths start";
if (mDstAmount == beast::zero)
{
@@ -444,11 +437,6 @@ Pathfinder::getPathLiquidity(
void
Pathfinder::computePathRanks(int maxPaths, std::function<bool(void)> const& continueCallback)
{
using namespace telemetry;
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::rank);
span.setAttribute(pathfind_span::attr::numPaths, static_cast<int64_t>(maxPaths));
mRemainingAmount = convertAmount(mDstAmount, convert_all_);
// Must subtract liquidity in default path from remaining amount.

View File

@@ -18,8 +18,13 @@ Json::Value
doPathFind(RPC::JsonContext& context)
{
using namespace telemetry;
[[maybe_unused]] auto span = SpanGuard::span(
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::request);
if (auto const& src = context.params[jss::source_account]; src.isString())
span.setAttribute(pathfind_span::attr::sourceAccount, src.asString());
if (auto const& dst = context.params[jss::destination_account]; dst.isString())
span.setAttribute(pathfind_span::attr::destAccount, dst.asString());
if (context.app.config().PATH_SEARCH_MAX == 0)
return rpcError(rpcNOT_SUPPORTED);

View File

@@ -26,8 +26,13 @@ Json::Value
doRipplePathFind(RPC::JsonContext& context)
{
using namespace telemetry;
[[maybe_unused]] auto span = SpanGuard::span(
auto span = SpanGuard::span(
TraceCategory::Rpc, pathfind_span::prefix::pathfind, pathfind_span::op::request);
if (auto const& src = context.params[jss::source_account]; src.isString())
span.setAttribute(pathfind_span::attr::sourceAccount, src.asString());
if (auto const& dst = context.params[jss::destination_account]; dst.isString())
span.setAttribute(pathfind_span::attr::destAccount, dst.asString());
if (context.app.config().PATH_SEARCH_MAX == 0)
return rpcError(rpcNOT_SUPPORTED);