The phase-9 NodeStore I/O totals, write-load/read-queue, read-threads,
and object instance-count panels rendered large cumulative values with
unit "none". Switch to "short" for readable abbreviation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Count and message-volume panels (operating-mode transitions, job queue
depth, network/overlay message totals, getobject message counts) used
unit "none", rendering large values as raw unscaled numbers. Switch to
"short" so Grafana abbreviates (e.g. 1.5 Mil) for readability.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The system-* dashboards queried the legacy StatsD rippled_ prefix, but the
node now emits beast::insight metrics via native OTLP under the xrpld_
prefix (config: [insight] server=otel, prefix=xrpld). All queries returned
no data.
Migration (names derived from C++ beast::insight registrations, not live
Prometheus, since a syncing node does not emit every metric yet):
- rippled_ -> xrpld_ prefix across all panel queries and template variables
(including the $node variable query, which broke the whole dashboard filter)
- Histogram Event instruments export with unit ms, so bare _bucket becomes
_milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full
- Job-type metrics were StatsD summaries (label quantile="$quantile"); on the
OTLP path they are histograms. Converted those queries to
histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m]))
and added the previously-undefined $quantile template variable
- Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket
No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or
job types the syncing node has not run) show no data until the path executes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P100 from a histogram is degenerate — it always returns the upper bound of
the highest populated bucket (a single slow outlier pins it to the top
boundary), producing a flat line. Revert to meaningful quantiles:
- Job Queue Wait Time / Job Execution Time: p75 (typical) + p99 (tail)
- Per-Job-Type / Per-Method: p99
- Added gauge panels showing current p99 with green/yellow/red thresholds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The three duration heatmaps (transaction, consensus accept, RPC latency)
had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick
values rendered unscaled. Set unit=ms on both the yAxis options and
panel defaults so buckets display as proper millisecond values.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Grafana does not recognize 'us' as a unit code, so microsecond values
rendered as raw numbers with a plain 'us' suffix (no scaling). The
correct code is 'µs'. Affects job-queue and OTel RPC latency panels
backed by *_duration_us histograms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.
Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.
Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shows p95 latency of tx.process span broken down by tx_type. Works for
both received and locally-processed transactions, unlike the tx.transactor
panel which requires the node to be synced and applying.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve consensus dashboard conflict and remove duplicate
consensus_state dimension in collector config.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collector config: add tx_type, ter_result, txq_status, consensus_state,
load_type, is_batch as spanmetrics dimensions so they appear as
Prometheus labels for dashboard queries.
New dashboard panels:
- Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie),
Transactor Duration p95 by Type
- Consensus Health: Outcome Distribution (pie), Failures Over Time
- RPC Performance: Resource Cost by Command, Batch vs Single
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The check-rename CI job requires all rename scripts to have been run.
The telemetry config files had 'prefix=rippled' which should be 'xrpld'.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
read the same data via server_info / server_state RPC; OTel SDK 1.18.0
cannot refresh resource attrs at runtime so resource-level emission was
not viable either.
- Namespace all pathfind span attributes under pathfind_* (underscore form
per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
xrpl.pathfind.ledger_index -> pathfind_ledger_index.
- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
in doPathFind / doRipplePathFind handlers (only when present + string).
- Collapse per-asset pathfind.discover / pathfind.rank spans into one
pathfind.discover hoisted around the per-source-asset loop in
PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
per-asset breakdown traded for bounded storage and cardinality. Trade-off
documented inline.
- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
the loop (paths actually returned) instead of the maxPaths input cap.
- PathRequestManager::updateAll: move span creation after the locked
requests_ snapshot, early-return when no active subscriptions exist
(avoids empty span on every ledger close), set pathfind_num_requests
= requests.size().
- Update Phase2_taskList.md and 02-design-decisions.md to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>