Commit Graph

14644 Commits

Author SHA1 Message Date
Pratik Mankawde
793d2ecfce feat(telemetry): add txq expired/dropped counters for queue backpressure
The transaction queue had no metric for demand that leaves or never enters the
queue, so fee-underpayment abandonment and admission-control rejection were
invisible (distinct from jq_trans_overflow, which is the job queue).

Add two synchronous counters via MetricsRegistry:
- xrpld_txq_expired_total — incremented in TxQ::processClosedLedger() for each
  queued transaction removed because its LastLedgerSequence passed (submitters
  who under-bid the escalating fee and were never included).
- xrpld_txq_dropped_total{reason} — incremented in TxQ::apply() at the
  queue-full admission-control returns (reason="queue_full").

Both reach MetricsRegistry via the Application& parameter already passed to
these methods; calls are null-guarded so they no-op when telemetry is disabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:06:33 +01:00
Pratik Mankawde
7a509a01eb feat(telemetry): add xrpld_ledger_history_mismatch_total{reason} counter
LedgerHistory::handleMismatch() already classifies a built-vs-validated ledger
mismatch (prior ledger, close time, consensus tx set, same/different tx set),
but only bumped a single untyped beast::insight counter — the reason was
dropped. Fork diagnosis was therefore a log-grep exercise.

Add a labeled OTel counter so the mismatch reason is a queryable time series:
- MetricsRegistry: new ledgerHistoryMismatchCounter_ + incrementLedgerHistoryMismatch(reason)
- LedgerHistory: record one reason per classification branch (unknown,
  prior_ledger, close_time, consensus_txset, same_txset_diff_result,
  different_txset). Reaches MetricsRegistry via the existing app_ reference.

The existing beast::insight mismatchCounter_ is left intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:02:26 +01:00
Pratik Mankawde
d3955d3639 fix(telemetry): emit real diverged-peer count for peers_insane_count
The xrpld_peer_quality{metric="peers_insane_count"} gauge was hardcoded to 0.0
with a TODO, leaving the "Insane/Diverged Peers" panel permanently empty.

PeerImp::json() already exposes the peer's tracking state via the "track"
field (set to "diverged" when tracking_ == Tracking::Diverged). The peer-quality
callback already iterates peer->json() for latency and version, so count peers
whose "track" field equals "diverged" in the same loop — no change to the
abstract Peer interface required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:53:53 +01:00
Pratik Mankawde
d7baf262f8 fix(telemetry): remove duplicate consensus outcome/failures panels
A phase-8->phase-9 merge (a675897aaf) duplicated the "Consensus Outcome
Distribution" and "Consensus Failures Over Time" panels: both appeared twice
with byte-identical queries (verified ignoring gridPos). The pair existed once
on phase-6/7/8 and became two on phase-9 only, so the duplication originated
in phase-9's own merge history.

Remove the second (lower) copy of each and re-stack panel y-positions with no
gaps. The single retained copy keeps the original y=64 row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:51:52 +01:00
Pratik Mankawde
b286335ccf feat(telemetry): add load-factor attribution and 7-day agreement panels
Both metrics are already emitted and live in Prometheus but were not fully
visualised.

- Fee Market (xrpld-fee-market.json): "Load Factor Attribution (Stacked
  Components)" — stacks load_factor_fee_escalation / fee_queue / local / net /
  cluster so an operator can see which component drives the effective fee. The
  existing panels showed the aggregate only.
- Validator Health (xrpld-validator-health.json): "Agreement % (7d)" and
  "Agreements vs Missed (7d)" — the xrpld_validation_agreement gauge already
  observes agreement_pct_7d / agreements_7d / missed_7d, but the dashboard only
  plotted 1h and 24h windows.

Panels follow the existing template: $node filter, exported_instance in legends,
Title Case, axis labels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:44:07 +01:00
Pratik Mankawde
5c2997d95e Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
2026-06-04 15:41:20 +01:00
Pratik Mankawde
342b9f55a1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 15:40:17 +01:00
Pratik Mankawde
000ad1d1f5 feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard)
The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp)
are emitted but had no dashboard coverage. The existing RPC & Pathfinding
dashboard only plotted StatsD timers. Add span-derived rows:

- gRPC Request Rate by Method (grpc.* by method)
- gRPC Latency P95 by Method
- gRPC Error Rate by Status (by grpc_status)
- Pathfinding Compute Duration (pathfind.compute p95/p50)
- Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover)

otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics
dimensions (bounded value sets). Add a $grpc_method template variable so the
gRPC panels can be filtered by method, consistent with the dashboard filter
conventions.

Note: these spans populate only when the node serves gRPC / pathfinding
traffic; they are correct but not exercised by the current health-check
workload (they will be covered by the Phase 10 workload generator).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:40:07 +01:00
Pratik Mankawde
17ffe8b049 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 15:37:55 +01:00
Pratik Mankawde
63c6f3b8df feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards
The consensus state-machine and TxQ lifecycle spans are emitted by the code
and present in Prometheus, but no panel visualised them. Add panels keyed on
those span_names (verified live) plus the low-cardinality dimensions needed to
break them down.

Consensus Health (consensus-health.json) — new rows:
- Consensus Round Duration (full round, p95/p50, mode-filterable)
- Consensus Phase Duration (open vs establish breakdown)
- Position Update Duration (update_positions p95/p50)
- Consensus Stall Rate (consensus.check by consensus_stalled)
- Consensus Mode-Change Rate by Target Mode (mode_change by mode_new)

Transaction Overview (transaction-overview.json) — new rows:
- TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type)
- Queue Bypass Ratio (txq.apply_direct vs txq.enqueue)
- Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50)
- Queue Cleanup Rate (txq.cleanup expired entries)

otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle
breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result
(all bounded value sets, safe as Prometheus labels).

All new panels follow the existing dashboard template: $node filter,
exported_instance in every legend, Title Case, axis labels, row layout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:37:29 +01:00
Pratik Mankawde
4174aef07b fix(telemetry): align consensus_mode spanmetrics label with emitted attribute
The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code
emits the span attribute under the bare key `consensus_mode` (matching every
other dimension after the Phase 6 rename). The mismatch left the
`xrpl_consensus_mode` Prometheus label empty, so the Consensus Health
"Consensus Mode Over Time" panel and the `$consensus_mode` template variable
(which filters every panel) matched no live series.

- otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode`
- consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode`
  (the `$consensus_mode` Grafana variable name is unchanged)
- telemetry-runbook.md: refresh the stale spanmetrics label table to the bare
  names actually emitted (command/rpc_status/consensus_mode/local/
  proposal_trusted/validation_trusted), fix dotted->bare attribute names in
  span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id,
  consensus_ledger_id, consensus_round, tx_id event attr), correct the
  consensus_round_id query to int (not quoted string), and fix the
  load_type value query ("exception_rpc" -> "exceptioned RPC").

Verified against the live stack: Tempo span tags confirm bare attribute keys
(consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode
series in Prometheus is stale retained data from an older build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:29:45 +01:00
Pratik Mankawde
e6643a4389 updated tags
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 14:46:57 +01:00
Pratik Mankawde
80800ee130 use image-renderer in graphana
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 14:40:35 +01:00
Pratik Mankawde
ebc5c5ed9d fix(telemetry): set service_instance_id in [insight] so dashboards filter
beast::insight metrics exported via OTLP carried no exported_instance
label because [insight] omitted service_instance_id (only [telemetry]
set it). Every system-* dashboard filters insight metrics with
exported_instance=~"$node", and the $node template variable is sourced
from label_values(..., exported_instance) — so with the label absent,
$node was empty and all insight-backed panels showed no data.

Add service_instance_id to [insight] in both telemetry configs, matching
the [telemetry] value (xrpld-mainnet / xrpld-devnet). CollectorManager
already reads this key and passes it to OTelCollector, which sets the
service.instance.id resource attribute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:36:04 +01:00
Pratik Mankawde
61c2760296 consmetic updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 14:32:13 +01:00
Pratik Mankawde
88ac4b6aee fix(telemetry): use short unit for NodeStore and object-count panels
The phase-9 NodeStore I/O totals, write-load/read-queue, read-threads,
and object instance-count panels rendered large cumulative values with
unit "none". Switch to "short" for readable abbreviation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:27:53 +01:00
Pratik Mankawde
a5f80514a9 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:26:16 +01:00
Pratik Mankawde
90f7a8bd4e Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 14:26:16 +01:00
Pratik Mankawde
45ab508ed8 fix(telemetry): use short unit for large count/message panels
Count and message-volume panels (operating-mode transitions, job queue
depth, network/overlay message totals, getobject message counts) used
unit "none", rendering large values as raw unscaled numbers. Switch to
"short" so Grafana abbreviates (e.g. 1.5 Mil) for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:26:03 +01:00
Pratik Mankawde
a6cebf21b0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/system-node-health.json
2026-06-04 14:06:46 +01:00
Pratik Mankawde
6c71aa8c2a Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:05:25 +01:00
Pratik Mankawde
9b46a343fc fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics
The system-* dashboards queried the legacy StatsD rippled_ prefix, but the
node now emits beast::insight metrics via native OTLP under the xrpld_
prefix (config: [insight] server=otel, prefix=xrpld). All queries returned
no data.

Migration (names derived from C++ beast::insight registrations, not live
Prometheus, since a syncing node does not emit every metric yet):
- rippled_ -> xrpld_ prefix across all panel queries and template variables
  (including the $node variable query, which broke the whole dashboard filter)
- Histogram Event instruments export with unit ms, so bare _bucket becomes
  _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full
- Job-type metrics were StatsD summaries (label quantile="$quantile"); on the
  OTLP path they are histograms. Converted those queries to
  histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m]))
  and added the previously-undefined $quantile template variable
- Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket

No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or
job types the syncing node has not run) show no data until the path executes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:01:13 +01:00
Pratik Mankawde
10b4112382 fix(telemetry): use p75/p99 quantiles and add gauge panels for job/rpc latency
P100 from a histogram is degenerate — it always returns the upper bound of
the highest populated bucket (a single slow outlier pins it to the top
boundary), producing a flat line. Revert to meaningful quantiles:
- Job Queue Wait Time / Job Execution Time: p75 (typical) + p99 (tail)
- Per-Job-Type / Per-Method: p99
- Added gauge panels showing current p99 with green/yellow/red thresholds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 12:46:58 +01:00
Pratik Mankawde
859bd21ca5 only render p100.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 12:16:48 +01:00
Pratik Mankawde
15d3e3a375 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 11:28:04 +01:00
Pratik Mankawde
0fe09cda9b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 11:28:04 +01:00
Pratik Mankawde
a9cc1067d0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 11:28:04 +01:00
Pratik Mankawde
194f5b8af8 fix(telemetry): set ms unit on duration heatmap y-axes
The three duration heatmaps (transaction, consensus accept, RPC latency)
had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick
values rendered unscaled. Set unit=ms on both the yAxis options and
panel defaults so buckets display as proper millisecond values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 11:27:46 +01:00
Pratik Mankawde
37c9168065 fix(telemetry): correct invalid 'us' unit code to 'µs' on duration panels
Grafana does not recognize 'us' as a unit code, so microsecond values
rendered as raw numbers with a plain 'us' suffix (no scaling). The
correct code is 'µs'. Affects job-queue and OTel RPC latency panels
backed by *_duration_us histograms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 11:26:43 +01:00
Pratik Mankawde
373012e84d Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:55:36 +01:00
Pratik Mankawde
8f9fa52f93 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:55:35 +01:00
Pratik Mankawde
fb7c3bc38d Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-06-04 10:55:27 +01:00
Pratik Mankawde
8e606bbaf4 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:55:11 +01:00
Pratik Mankawde
40fba327cf Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:53:56 +01:00
Pratik Mankawde
811b934004 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:53:55 +01:00
Pratik Mankawde
c80038fd42 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:53:55 +01:00
Pratik Mankawde
7397bbcdd2 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:53:45 +01:00
Pratik Mankawde
8259026a25 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:47:47 +01:00
Pratik Mankawde
9947a52e79 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:47:47 +01:00
Pratik Mankawde
ee2f1b4fbf Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:47:47 +01:00
Pratik Mankawde
2627ea7f65 feat(telemetry): add TX Processing Latency by Type panel to dashboard
Shows p95 latency of tx.process span broken down by tx_type. Works for
both received and locally-processed transactions, unlike the tx.transactor
panel which requires the node to be synced and applying.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:47:33 +01:00
Pratik Mankawde
56d33fc87f Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-03 17:25:22 +01:00
Pratik Mankawde
013252f210 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 17:25:22 +01:00
Pratik Mankawde
970914d2ce Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-03 17:25:22 +01:00
Pratik Mankawde
289b049b70 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-03 17:25:22 +01:00
Pratik Mankawde
4e422a0354 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 17:25:22 +01:00
Pratik Mankawde
36cae13352 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-03 17:25:22 +01:00
Pratik Mankawde
dfd67b8124 fix(telemetry): eliminate duplicate suppressed attribute on tx.receive span
The OTel C++ SDK's SetAttribute appends rather than overwrites on
in-flight spans. Setting suppressed=false as a default then overriding
to true resulted in both values appearing in the exported span.

Fix: remove the default-false set, place suppressed=false once after
the HashRouter check passes (non-suppressed path), and suppressed=true
remains only in the suppressed path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 17:23:59 +01:00
Pratik Mankawde
a675897aaf Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Resolve consensus dashboard conflict and remove duplicate
consensus_state dimension in collector config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:53:10 +01:00
Pratik Mankawde
f60c995fe1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 16:52:00 +01:00