Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.
- collector: add the `stage` dimension to the spanmetrics connector so
the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
with rate, p95 latency, and failure-rate panels grouped by stage, plus
a `stage` template variable. Panels follow the existing config (node
filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
status, because a failing ter code completes the span normally — only
thrown exceptions set an error status. This matches the existing
"Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
collection reference and runbook, including the sampling caveat that
span-derived metrics inherit tracer head-sampling and undercount at
sampling_ratio < 1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The tx.transactor span covered only the apply stage; preflight and
preclaim had no telemetry, so a transaction that hard-failed those
stages produced no apply-pipeline span and per-stage latency/failure
was invisible.
Add tx.preflight and tx.preclaim spans in applySteps.cpp via a
makeStageSpan() helper using SpanGuard::hashSpan, so all three stages
share a deterministic trace_id derived from txID[0:16] even though they
run sequentially and often cross-thread. Each span carries stage,
tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before
the public wrappers map them. The type lookup is guarded behind the
span-active check so it costs nothing when tracing is off.
Add a stage="apply" attribute to the tx.transactor span and move its
three hardcoded attribute strings to a new library-safe header
include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side
TxSpanNames.h strings so the collector spanmetrics connector aggregates
both span sets under one dimension set.
A constants-contract test pins the span-name, attribute-key, and
stage-value strings; span content stays covered by the docker
integration test, as the rest of the telemetry suite is.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reduced from 30 to 7 filters: service.instance.id, name, status, command,
tx_hash, tx_type, ledger_hash. Full attribute inventory is in
OpenTelemetryPlan/09-data-collection-reference.md §4; TraceQL autocomplete
covers the rest.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
From the code-review pass:
- transaction-overview.json: the tx.process and tx.transactor latency-by-type
panels used lowercase legends (p95/p50) without the per-node dimension. Use
Title Case (P95/P50), add exported_instance to the by() clause, and include
[{{exported_instance}}] in the legend, per the dashboard legend convention.
- consensus-health.json: panel descriptions still referenced the old dotted
attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename;
update them to the bare emitted names (consensus_mode, ledger_seq).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
handleMismatch() recorded the "consensus_txset" reason but then fell through
to the transaction-level comparison, which also recorded a reason
("same_txset_diff_result" / "different_txset"). A single mismatch with
disagreeing consensus tx-set hashes therefore incremented
xrpld_ledger_history_mismatch_total twice across two reason labels, so the
sum over reason exceeded the real mismatch count.
The consensus tx-set hash disagreement is the root cause; return after
recording it so each mismatch contributes exactly one reason.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The A1 fix (xrpl_consensus_mode -> consensus_mode) was applied on phase-6, but
the phase-8->phase-9 merge conflict resolution for consensus-health.json took
phase-9's pre-fix panel base, silently reintroducing all 11 stale
xrpl_consensus_mode label references (the spanmetrics label that is never
populated — see the original A1 commit).
Re-apply the label fix on phase-9: xrpl_consensus_mode -> consensus_mode in
every panel expr, legendFormat, and the $consensus_mode template variable's
label_values() query. The Grafana variable name $consensus_mode is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Visualise the metrics added in this series:
- consensus-health: "Ledger History Mismatch Rate by Reason"
(xrpld_ledger_history_mismatch_total by reason — fork diagnostics)
- fee-market: "Queue Abandonment Rate (Expired)" and "Queue Admission
Rejections (Dropped)" (xrpld_txq_expired_total / dropped_total)
- peer-network: "Reduce-Relay Peer Selection" and "Reduce-Relay Missing-Tx
Frequency" (xrpld_reduce_relay_metrics)
- system-node-health: "Ledger Acquire Duration" and "Ledger Acquire Rate by
Outcome" (ledger.acquire span)
otel-collector-config.yaml: add outcome and acquire_reason spanmetrics
dimensions so the ledger.acquire outcome breakdown populates.
All panels follow the existing template: $node filter, exported_instance in
legends, Title Case, axis labels.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the new synchronous counters (ledger_history_mismatch_total{reason},
txq_expired_total, txq_dropped_total{reason}) and the reduce-relay observable
gauge to the ASCII ownership diagram in the MetricsRegistry header so the
documented instrument inventory matches the code.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The transaction reduce-relay subsystem (selected vs suppressed peers,
feature-disabled peers, missing-tx frequency) was computed in OverlayImpl's
TxMetrics but only surfaced via the get_counts JSON RPC — invisible to
Prometheus/Grafana, despite being the central efficiency KPI for the feature.
Add an observable gauge xrpld_reduce_relay_metrics{metric} that reads
Overlay::txMetrics() and parses its rolling-average fields:
- selected_peers (txr_selected_cnt)
- suppressed_peers (txr_suppressed_cnt)
- not_enabled_peers (txr_not_enabled_cnt)
- missing_tx_freq (txr_missing_tx_freq)
The JSON values are decimal strings (std::to_string), parsed via std::stoll —
the same JSON-reading pattern as registerNodeStoreGauge. No new Overlay
accessor or core-interface change required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
InboundLedger drives ledger back-fill and fork recovery with timeout/retry
logic (kLedgerTimeoutRetriesMax = 6), but emitted only a global ledger_fetches
counter — sync/recovery cost was a telemetry blind spot.
Add a ledger.acquire span that wraps the acquisition lifecycle:
- Started in InboundLedger::init() with ledger_seq and acquire_reason
(history / consensus / generic, mirroring InboundLedger::Reason).
- Finalized in InboundLedger::done() with outcome (complete / failed),
timeouts, and peer_count, then reset so the span duration is exported.
Held as a std::optional<SpanGuard> member (same pattern as RCLConsensus
roundSpan_). New op/attr/val constants added to LedgerSpanNames.h. Compiles to
a no-op when telemetry is disabled via the SpanGuard fallback.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The transaction queue had no metric for demand that leaves or never enters the
queue, so fee-underpayment abandonment and admission-control rejection were
invisible (distinct from jq_trans_overflow, which is the job queue).
Add two synchronous counters via MetricsRegistry:
- xrpld_txq_expired_total — incremented in TxQ::processClosedLedger() for each
queued transaction removed because its LastLedgerSequence passed (submitters
who under-bid the escalating fee and were never included).
- xrpld_txq_dropped_total{reason} — incremented in TxQ::apply() at the
queue-full admission-control returns (reason="queue_full").
Both reach MetricsRegistry via the Application& parameter already passed to
these methods; calls are null-guarded so they no-op when telemetry is disabled.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LedgerHistory::handleMismatch() already classifies a built-vs-validated ledger
mismatch (prior ledger, close time, consensus tx set, same/different tx set),
but only bumped a single untyped beast::insight counter — the reason was
dropped. Fork diagnosis was therefore a log-grep exercise.
Add a labeled OTel counter so the mismatch reason is a queryable time series:
- MetricsRegistry: new ledgerHistoryMismatchCounter_ + incrementLedgerHistoryMismatch(reason)
- LedgerHistory: record one reason per classification branch (unknown,
prior_ledger, close_time, consensus_txset, same_txset_diff_result,
different_txset). Reaches MetricsRegistry via the existing app_ reference.
The existing beast::insight mismatchCounter_ is left intact.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The xrpld_peer_quality{metric="peers_insane_count"} gauge was hardcoded to 0.0
with a TODO, leaving the "Insane/Diverged Peers" panel permanently empty.
PeerImp::json() already exposes the peer's tracking state via the "track"
field (set to "diverged" when tracking_ == Tracking::Diverged). The peer-quality
callback already iterates peer->json() for latency and version, so count peers
whose "track" field equals "diverged" in the same loop — no change to the
abstract Peer interface required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A phase-8->phase-9 merge (a675897aaf) duplicated the "Consensus Outcome
Distribution" and "Consensus Failures Over Time" panels: both appeared twice
with byte-identical queries (verified ignoring gridPos). The pair existed once
on phase-6/7/8 and became two on phase-9 only, so the duplication originated
in phase-9's own merge history.
Remove the second (lower) copy of each and re-stack panel y-positions with no
gaps. The single retained copy keeps the original y=64 row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both metrics are already emitted and live in Prometheus but were not fully
visualised.
- Fee Market (xrpld-fee-market.json): "Load Factor Attribution (Stacked
Components)" — stacks load_factor_fee_escalation / fee_queue / local / net /
cluster so an operator can see which component drives the effective fee. The
existing panels showed the aggregate only.
- Validator Health (xrpld-validator-health.json): "Agreement % (7d)" and
"Agreements vs Missed (7d)" — the xrpld_validation_agreement gauge already
observes agreement_pct_7d / agreements_7d / missed_7d, but the dashboard only
plotted 1h and 24h windows.
Panels follow the existing template: $node filter, exported_instance in legends,
Title Case, axis labels.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp)
are emitted but had no dashboard coverage. The existing RPC & Pathfinding
dashboard only plotted StatsD timers. Add span-derived rows:
- gRPC Request Rate by Method (grpc.* by method)
- gRPC Latency P95 by Method
- gRPC Error Rate by Status (by grpc_status)
- Pathfinding Compute Duration (pathfind.compute p95/p50)
- Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover)
otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics
dimensions (bounded value sets). Add a $grpc_method template variable so the
gRPC panels can be filtered by method, consistent with the dashboard filter
conventions.
Note: these spans populate only when the node serves gRPC / pathfinding
traffic; they are correct but not exercised by the current health-check
workload (they will be covered by the Phase 10 workload generator).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The consensus state-machine and TxQ lifecycle spans are emitted by the code
and present in Prometheus, but no panel visualised them. Add panels keyed on
those span_names (verified live) plus the low-cardinality dimensions needed to
break them down.
Consensus Health (consensus-health.json) — new rows:
- Consensus Round Duration (full round, p95/p50, mode-filterable)
- Consensus Phase Duration (open vs establish breakdown)
- Position Update Duration (update_positions p95/p50)
- Consensus Stall Rate (consensus.check by consensus_stalled)
- Consensus Mode-Change Rate by Target Mode (mode_change by mode_new)
Transaction Overview (transaction-overview.json) — new rows:
- TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type)
- Queue Bypass Ratio (txq.apply_direct vs txq.enqueue)
- Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50)
- Queue Cleanup Rate (txq.cleanup expired entries)
otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle
breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result
(all bounded value sets, safe as Prometheus labels).
All new panels follow the existing dashboard template: $node filter,
exported_instance in every legend, Title Case, axis labels, row layout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code
emits the span attribute under the bare key `consensus_mode` (matching every
other dimension after the Phase 6 rename). The mismatch left the
`xrpl_consensus_mode` Prometheus label empty, so the Consensus Health
"Consensus Mode Over Time" panel and the `$consensus_mode` template variable
(which filters every panel) matched no live series.
- otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode`
- consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode`
(the `$consensus_mode` Grafana variable name is unchanged)
- telemetry-runbook.md: refresh the stale spanmetrics label table to the bare
names actually emitted (command/rpc_status/consensus_mode/local/
proposal_trusted/validation_trusted), fix dotted->bare attribute names in
span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id,
consensus_ledger_id, consensus_round, tx_id event attr), correct the
consensus_round_id query to int (not quoted string), and fix the
load_type value query ("exception_rpc" -> "exceptioned RPC").
Verified against the live stack: Tempo span tags confirm bare attribute keys
(consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode
series in Prometheus is stale retained data from an older build.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
beast::insight metrics exported via OTLP carried no exported_instance
label because [insight] omitted service_instance_id (only [telemetry]
set it). Every system-* dashboard filters insight metrics with
exported_instance=~"$node", and the $node template variable is sourced
from label_values(..., exported_instance) — so with the label absent,
$node was empty and all insight-backed panels showed no data.
Add service_instance_id to [insight] in both telemetry configs, matching
the [telemetry] value (xrpld-mainnet / xrpld-devnet). CollectorManager
already reads this key and passes it to OTelCollector, which sets the
service.instance.id resource attribute.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The phase-9 NodeStore I/O totals, write-load/read-queue, read-threads,
and object instance-count panels rendered large cumulative values with
unit "none". Switch to "short" for readable abbreviation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Count and message-volume panels (operating-mode transitions, job queue
depth, network/overlay message totals, getobject message counts) used
unit "none", rendering large values as raw unscaled numbers. Switch to
"short" so Grafana abbreviates (e.g. 1.5 Mil) for readability.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>