rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-23 23:20:33 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	ca3a78abce	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 16:43:25 +01:00
Pratik Mankawde	0a800069bf	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 16:43:25 +01:00
Pratik Mankawde	eef11a65fa	fix(telemetry): code-review dashboard cleanups (legends + stale descriptions) From the code-review pass: - transaction-overview.json: the tx.process and tx.transactor latency-by-type panels used lowercase legends (p95/p50) without the per-node dimension. Use Title Case (P95/P50), add exported_instance to the by() clause, and include [{{exported_instance}}] in the legend, per the dashboard legend convention. - consensus-health.json: panel descriptions still referenced the old dotted attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename; update them to the bare emitted names (consensus_mode, ledger_seq). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 16:43:12 +01:00
Pratik Mankawde	342b9f55a1	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 15:40:17 +01:00
Pratik Mankawde	000ad1d1f5	feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard) The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp) are emitted but had no dashboard coverage. The existing RPC & Pathfinding dashboard only plotted StatsD timers. Add span-derived rows: - gRPC Request Rate by Method (grpc.* by method) - gRPC Latency P95 by Method - gRPC Error Rate by Status (by grpc_status) - Pathfinding Compute Duration (pathfind.compute p95/p50) - Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover) otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics dimensions (bounded value sets). Add a $grpc_method template variable so the gRPC panels can be filtered by method, consistent with the dashboard filter conventions. Note: these spans populate only when the node serves gRPC / pathfinding traffic; they are correct but not exercised by the current health-check workload (they will be covered by the Phase 10 workload generator). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:40:07 +01:00
Pratik Mankawde	17ffe8b049	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 15:37:55 +01:00
Pratik Mankawde	63c6f3b8df	feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards The consensus state-machine and TxQ lifecycle spans are emitted by the code and present in Prometheus, but no panel visualised them. Add panels keyed on those span_names (verified live) plus the low-cardinality dimensions needed to break them down. Consensus Health (consensus-health.json) — new rows: - Consensus Round Duration (full round, p95/p50, mode-filterable) - Consensus Phase Duration (open vs establish breakdown) - Position Update Duration (update_positions p95/p50) - Consensus Stall Rate (consensus.check by consensus_stalled) - Consensus Mode-Change Rate by Target Mode (mode_change by mode_new) Transaction Overview (transaction-overview.json) — new rows: - TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type) - Queue Bypass Ratio (txq.apply_direct vs txq.enqueue) - Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50) - Queue Cleanup Rate (txq.cleanup expired entries) otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result (all bounded value sets, safe as Prometheus labels). All new panels follow the existing dashboard template: $node filter, exported_instance in every legend, Title Case, axis labels, row layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:37:29 +01:00
Pratik Mankawde	4174aef07b	fix(telemetry): align consensus_mode spanmetrics label with emitted attribute The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code emits the span attribute under the bare key `consensus_mode` (matching every other dimension after the Phase 6 rename). The mismatch left the `xrpl_consensus_mode` Prometheus label empty, so the Consensus Health "Consensus Mode Over Time" panel and the `$consensus_mode` template variable (which filters every panel) matched no live series. - otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode` - consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode` (the `$consensus_mode` Grafana variable name is unchanged) - telemetry-runbook.md: refresh the stale spanmetrics label table to the bare names actually emitted (command/rpc_status/consensus_mode/local/ proposal_trusted/validation_trusted), fix dotted->bare attribute names in span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id, consensus_ledger_id, consensus_round, tx_id event attr), correct the consensus_round_id query to int (not quoted string), and fix the load_type value query ("exception_rpc" -> "exceptioned RPC"). Verified against the live stack: Tempo span tags confirm bare attribute keys (consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode series in Prometheus is stale retained data from an older build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:29:45 +01:00
Pratik Mankawde	a5f80514a9	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:26:16 +01:00
Pratik Mankawde	45ab508ed8	fix(telemetry): use short unit for large count/message panels Count and message-volume panels (operating-mode transitions, job queue depth, network/overlay message totals, getobject message counts) used unit "none", rendering large values as raw unscaled numbers. Switch to "short" so Grafana abbreviates (e.g. 1.5 Mil) for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:26:03 +01:00
Pratik Mankawde	6c71aa8c2a	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:05:25 +01:00
Pratik Mankawde	9b46a343fc	fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics The system-* dashboards queried the legacy StatsD rippled_ prefix, but the node now emits beast::insight metrics via native OTLP under the xrpld_ prefix (config: [insight] server=otel, prefix=xrpld). All queries returned no data. Migration (names derived from C++ beast::insight registrations, not live Prometheus, since a syncing node does not emit every metric yet): - rippled_ -> xrpld_ prefix across all panel queries and template variables (including the $node variable query, which broke the whole dashboard filter) - Histogram Event instruments export with unit ms, so bare _bucket becomes _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full - Job-type metrics were StatsD summaries (label quantile="$quantile"); on the OTLP path they are histograms. Converted those queries to histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m])) and added the previously-undefined $quantile template variable - Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or job types the syncing node has not run) show no data until the path executes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:01:13 +01:00
Pratik Mankawde	15d3e3a375	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 11:28:04 +01:00
Pratik Mankawde	0fe09cda9b	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 11:28:04 +01:00
Pratik Mankawde	194f5b8af8	fix(telemetry): set ms unit on duration heatmap y-axes The three duration heatmaps (transaction, consensus accept, RPC latency) had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick values rendered unscaled. Set unit=ms on both the yAxis options and panel defaults so buckets display as proper millisecond values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 11:27:46 +01:00
Pratik Mankawde	8f9fa52f93	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:55:35 +01:00
Pratik Mankawde	fb7c3bc38d	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-06-04 10:55:27 +01:00
Pratik Mankawde	8e606bbaf4	feat(telemetry): add tx_type/ter_result/txq_status dashboard filters Adds template variables $tx_type, $ter_result, $txq_status to the Transaction Overview dashboard. All relevant panels now respect these filters, enabling operators to drill into specific transaction types or result codes. Changes: - Panel 2 renamed to "Transaction Processing Latency by Type" (now shows p95/p50 per tx_type instead of aggregate) - Panels 1,3,4,5,7,9,12 filter by $tx_type - Panel 10 filters by $tx_type and $ter_result - Panel 11 filters by $txq_status - Removed redundant "TX Processing Latency by Type (p95)" panel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:55:11 +01:00
Pratik Mankawde	811b934004	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:53:55 +01:00
Pratik Mankawde	c80038fd42	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 10:53:55 +01:00
Pratik Mankawde	7397bbcdd2	feat(telemetry): add tx_type/ter_result/txq_status dashboard filters Adds template variables $tx_type, $ter_result, $txq_status to the Transaction Overview dashboard. All relevant panels now respect these filters, enabling operators to drill into specific transaction types or result codes. Changes: - Panel 2 renamed to "Transaction Processing Latency by Type" (now shows p95/p50 per tx_type instead of aggregate) - Panels 1,3,4,5,7,9,12 filter by $tx_type - Panel 10 filters by $tx_type and $ter_result - Panel 11 filters by $txq_status - Removed redundant "TX Processing Latency by Type (p95)" panel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:53:45 +01:00
Pratik Mankawde	9947a52e79	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:47:47 +01:00
Pratik Mankawde	ee2f1b4fbf	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 10:47:47 +01:00
Pratik Mankawde	2627ea7f65	feat(telemetry): add TX Processing Latency by Type panel to dashboard Shows p95 latency of tx.process span broken down by tx_type. Works for both received and locally-processed transactions, unlike the tx.transactor panel which requires the node to be synced and applying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:47:33 +01:00
Pratik Mankawde	013252f210	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 17:25:22 +01:00
Pratik Mankawde	970914d2ce	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 17:25:22 +01:00
Pratik Mankawde	289b049b70	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-03 17:25:22 +01:00
Pratik Mankawde	4e422a0354	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 17:25:22 +01:00
Pratik Mankawde	36cae13352	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-03 17:25:22 +01:00
Pratik Mankawde	dfd67b8124	fix(telemetry): eliminate duplicate suppressed attribute on tx.receive span The OTel C++ SDK's SetAttribute appends rather than overwrites on in-flight spans. Setting suppressed=false as a default then overriding to true resulted in both values appearing in the exported span. Fix: remove the default-false set, place suppressed=false once after the HashRouter check passes (non-suppressed path), and suppressed=true remains only in the suppressed path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 17:23:59 +01:00
Pratik Mankawde	f60c995fe1	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 16:52:00 +01:00
Pratik Mankawde	fff8598a33	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 16:52:00 +01:00
Pratik Mankawde	ac1805f0a4	feat(telemetry): add spanmetrics dimensions and dashboard panels for enriched attrs Collector config: add tx_type, ter_result, txq_status, consensus_state, load_type, is_batch as spanmetrics dimensions so they appear as Prometheus labels for dashboard queries. New dashboard panels: - Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie), Transactor Duration p95 by Type - Consensus Health: Outcome Distribution (pie), Failures Over Time - RPC Performance: Resource Cost by Command, Batch vs Single Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:51:51 +01:00
Pratik Mankawde	365907ab22	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 16:40:22 +01:00
Pratik Mankawde	8b5ded4324	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-03 16:40:22 +01:00
Pratik Mankawde	39f3b86d17	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 16:40:22 +01:00
Pratik Mankawde	2ef026aef5	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 16:40:22 +01:00
Pratik Mankawde	03fffec640	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-03 16:40:22 +01:00
Pratik Mankawde	a13a858112	feat(telemetry): add tx.transactor span for per-transactor execution timing Wraps Transactor::operator() with a span that captures tx_type, ter_result, and applied. This is the universal dispatch point — every transaction flows through it, giving per-type latency breakdown. Adds libxrpl.tx > xrpl.telemetry levelization dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:40:10 +01:00
Pratik Mankawde	a4bc7bd611	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 16:32:31 +01:00
Pratik Mankawde	8adb5d03da	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-03 16:32:31 +01:00
Pratik Mankawde	66552e7858	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 16:32:31 +01:00
Pratik Mankawde	2264a8427a	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 16:32:31 +01:00
Pratik Mankawde	c5bdaafc39	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-03 16:32:31 +01:00
Pratik Mankawde	4b6c1c270f	feat(telemetry): add tx.transactor span for per-transactor execution timing Wraps Transactor::operator() with a span that captures tx_type, ter_result, and applied. This is the universal dispatch point — every transaction flows through it, giving per-type latency breakdown. Adds libxrpl.tx > xrpl.telemetry levelization dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:32:16 +01:00
Pratik Mankawde	3eeb8b3730	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 16:22:40 +01:00
Pratik Mankawde	93c27997b4	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 16:22:35 +01:00
Pratik Mankawde	ac79a5123e	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd Resolve runbook conflict: keep both phase 6 ledger/peer span tables AND new insights/sample queries section from the enrichment work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:22:20 +01:00
Pratik Mankawde	1b227a1eff	docs(telemetry): update runbook with enriched attributes and sample queries Adds comprehensive "Insights and Sample Queries" section showing operators what questions they can answer with the newly-added span attributes: - Transaction workflow analysis (filter by tx_type, fee, ter_result) - TxQ health (txq_status, ledger_changed) - RPC debugging (is_batch, request_payload_size, load_type) - PathFinding performance (dest_currency, num_source_assets) - Consensus health (consensus_state, is_bow_out, disputes_count) - Cross-subsystem correlation examples Also updates all span reference tables with the new attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:18:43 +01:00
Pratik Mankawde	b0e9e1a24d	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 16:16:53 +01:00

1 2 3 4 5 ...

14564 Commits