Commit Graph

143 Commits

Author SHA1 Message Date
Pratik Mankawde
88ac4b6aee fix(telemetry): use short unit for NodeStore and object-count panels
The phase-9 NodeStore I/O totals, write-load/read-queue, read-threads,
and object instance-count panels rendered large cumulative values with
unit "none". Switch to "short" for readable abbreviation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:27:53 +01:00
Pratik Mankawde
90f7a8bd4e Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 14:26:16 +01:00
Pratik Mankawde
a5f80514a9 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:26:16 +01:00
Pratik Mankawde
45ab508ed8 fix(telemetry): use short unit for large count/message panels
Count and message-volume panels (operating-mode transitions, job queue
depth, network/overlay message totals, getobject message counts) used
unit "none", rendering large values as raw unscaled numbers. Switch to
"short" so Grafana abbreviates (e.g. 1.5 Mil) for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:26:03 +01:00
Pratik Mankawde
a6cebf21b0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/system-node-health.json
2026-06-04 14:06:46 +01:00
Pratik Mankawde
6c71aa8c2a Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:05:25 +01:00
Pratik Mankawde
9b46a343fc fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics
The system-* dashboards queried the legacy StatsD rippled_ prefix, but the
node now emits beast::insight metrics via native OTLP under the xrpld_
prefix (config: [insight] server=otel, prefix=xrpld). All queries returned
no data.

Migration (names derived from C++ beast::insight registrations, not live
Prometheus, since a syncing node does not emit every metric yet):
- rippled_ -> xrpld_ prefix across all panel queries and template variables
  (including the $node variable query, which broke the whole dashboard filter)
- Histogram Event instruments export with unit ms, so bare _bucket becomes
  _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full
- Job-type metrics were StatsD summaries (label quantile="$quantile"); on the
  OTLP path they are histograms. Converted those queries to
  histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m]))
  and added the previously-undefined $quantile template variable
- Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket

No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or
job types the syncing node has not run) show no data until the path executes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:01:13 +01:00
Pratik Mankawde
10b4112382 fix(telemetry): use p75/p99 quantiles and add gauge panels for job/rpc latency
P100 from a histogram is degenerate — it always returns the upper bound of
the highest populated bucket (a single slow outlier pins it to the top
boundary), producing a flat line. Revert to meaningful quantiles:
- Job Queue Wait Time / Job Execution Time: p75 (typical) + p99 (tail)
- Per-Job-Type / Per-Method: p99
- Added gauge panels showing current p99 with green/yellow/red thresholds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 12:46:58 +01:00
Pratik Mankawde
859bd21ca5 only render p100.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 12:16:48 +01:00
Pratik Mankawde
15d3e3a375 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 11:28:04 +01:00
Pratik Mankawde
0fe09cda9b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 11:28:04 +01:00
Pratik Mankawde
a9cc1067d0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 11:28:04 +01:00
Pratik Mankawde
194f5b8af8 fix(telemetry): set ms unit on duration heatmap y-axes
The three duration heatmaps (transaction, consensus accept, RPC latency)
had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick
values rendered unscaled. Set unit=ms on both the yAxis options and
panel defaults so buckets display as proper millisecond values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 11:27:46 +01:00
Pratik Mankawde
37c9168065 fix(telemetry): correct invalid 'us' unit code to 'µs' on duration panels
Grafana does not recognize 'us' as a unit code, so microsecond values
rendered as raw numbers with a plain 'us' suffix (no scaling). The
correct code is 'µs'. Affects job-queue and OTel RPC latency panels
backed by *_duration_us histograms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 11:26:43 +01:00
Pratik Mankawde
373012e84d Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:55:36 +01:00
Pratik Mankawde
8f9fa52f93 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:55:35 +01:00
Pratik Mankawde
fb7c3bc38d Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-06-04 10:55:27 +01:00
Pratik Mankawde
8e606bbaf4 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:55:11 +01:00
Pratik Mankawde
40fba327cf Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:53:56 +01:00
Pratik Mankawde
811b934004 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:53:55 +01:00
Pratik Mankawde
c80038fd42 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:53:55 +01:00
Pratik Mankawde
7397bbcdd2 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:53:45 +01:00
Pratik Mankawde
8259026a25 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 10:47:47 +01:00
Pratik Mankawde
9947a52e79 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:47:47 +01:00
Pratik Mankawde
ee2f1b4fbf Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:47:47 +01:00
Pratik Mankawde
2627ea7f65 feat(telemetry): add TX Processing Latency by Type panel to dashboard
Shows p95 latency of tx.process span broken down by tx_type. Works for
both received and locally-processed transactions, unlike the tx.transactor
panel which requires the node to be synced and applying.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:47:33 +01:00
Pratik Mankawde
a675897aaf Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Resolve consensus dashboard conflict and remove duplicate
consensus_state dimension in collector config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:53:10 +01:00
Pratik Mankawde
f60c995fe1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 16:52:00 +01:00
Pratik Mankawde
fff8598a33 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-03 16:52:00 +01:00
Pratik Mankawde
ac1805f0a4 feat(telemetry): add spanmetrics dimensions and dashboard panels for enriched attrs
Collector config: add tx_type, ter_result, txq_status, consensus_state,
load_type, is_batch as spanmetrics dimensions so they appear as
Prometheus labels for dashboard queries.

New dashboard panels:
- Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie),
  Transactor Duration p95 by Type
- Consensus Health: Outcome Distribution (pie), Failures Over Time
- RPC Performance: Resource Cost by Command, Batch vs Single

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:51:51 +01:00
Pratik Mankawde
11717a5431 build fixed
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 18:13:10 +01:00
Pratik Mankawde
615d339f84 fix(docs): apply rename scripts — prefix=rippled to prefix=xrpld
The check-rename CI job requires all rename scripts to have been run.
The telemetry config files had 'prefix=rippled' which should be 'xrpld'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-01 17:03:27 +01:00
Pratik Mankawde
4d6ddb5f1f Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 14:56:09 +01:00
Pratik Mankawde
ba7e1f98e4 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:24:43 +01:00
Pratik Mankawde
d7579b2861 formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:21:00 +01:00
Pratik Mankawde
088848e7ab formatting updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:20:08 +01:00
Pratik Mankawde
e7dea147cd Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:18:36 +01:00
Pratik Mankawde
8d730b8b9a Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:16:35 +01:00
Pratik Mankawde
2f96c6547c Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:51:31 +01:00
Pratik Mankawde
c187a62353 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:47:15 +01:00
Pratik Mankawde
c848e51e13 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:44:07 +01:00
Pratik Mankawde
3a1f22583f Merge branch 'pratik/otel-phase1a-plan-docs' into pratik/otel-phase1b-telemetry-infra
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 15:34:22 +01:00
Pratik Mankawde
7ac5343119 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:09:41 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Pratik Mankawde
ce04dac32e consensus total per round time panel added
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 14:54:36 +01:00
Pratik Mankawde
0330d037ef connection to mainnet added
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 14:53:29 +01:00
Ayaz Salikhov
23d0812827 style: Use shfmt instead of bashate (#7326) 2026-05-26 18:28:23 +00:00
Ayaz Salikhov
49cb3f45a4 ci: Add clang to nix images (#7308)
Co-authored-by: semgrep-companion-app[bot] <218312740+semgrep-companion-app[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-26 15:45:33 +00:00