rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-30 18:40:28 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	1fa39fdef6	fix(telemetry): move job-queue gauge to top and add stable panel ids The Current Job Latency gauge sat at the bottom of the Job Queue Analysis dashboard; per the dashboard guideline gauges belong at the top. Move it to the first row and reflow the remaining panels below it. Also assign explicit sequential panel ids so deep links stay stable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:25:40 +01:00
Pratik Mankawde	18121a8cf4	fix(telemetry): widen only timeseries with right-side table legends Correct the width rule from the previous layout commit. Full width (w=24) is now applied ONLY to timeseries panels whose legend is a right-side table, since those legends need the horizontal room. Panels with default/bottom legends, pie charts, and the heatmap return to half width. This narrows "Transaction Receive vs Suppressed" and "TxQ Enqueue Rate by Transaction Type", which were wrongly widened. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:24:06 +01:00
Pratik Mankawde	93c31573c5	refactor(telemetry): stable panel ids and topic-grouped layout Make transaction-overview deep-links stable and improve readability: - Assign explicit sequential panel ids (1..20) so viewPanel=panel-N URLs stay pinned to the same chart across edits. Previously ids were unset and Grafana auto-assigned them by array position, so any reorder silently repointed bookmarks. - Move the single-value stat panel (Transaction Apply Failed Rate) to the top row. - Lay out in three topic sections (Processing, Apply Pipeline, Queue). Within each, timeseries with a breakdown dimension (tx_type, stage, ter_result, suppressed) take full width so their right-side table legends are readable; single-series panels, pie charts, and the heatmap stay half-width and pair up. All six template variables already default to All (includeAll + multi); no change needed there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:02:47 +01:00
Pratik Mankawde	ceea9e49dd	fix(telemetry): standardize transaction-overview legends and cap tooltips Apply the dashboard legend convention across all panels now that the P50 series have been removed (P95-only): - Drop the redundant "P95 " / "P50 " prefix; the panel title already states the percentile. - Put every filter/dimension value inside [] comma-separated, ending with exported_instance, e.g. "AMMDeposit [Preclaim, xrpld-mainnet]". - Add exported_instance to the by() clause and legend of the three panels that filtered on $node but omitted it (Transaction Rate by Type, Transaction Results by Type, TxQ Accept Status), so per-node series are produced. - Title-case the stage value for display via label_replace in the four apply-pipeline panels; the span attribute stays lowercase (preflight/preclaim/apply) since legendFormat cannot change case. - Cap tooltip maxHeight at 500 on every panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 15:06:52 +01:00
Pratik Mankawde	db4d70bbc2	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-05 13:40:19 +01:00
Pratik Mankawde	b8dd848899	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 13:40:18 +01:00
Pratik Mankawde	b321792a14	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-05 13:40:18 +01:00
Pratik Mankawde	72642b5dc6	feat(telemetry): add tx apply latency panel by type and stage The existing apply-pipeline panels show latency by stage (all types combined) or by type (single span). Neither answers "for a given transaction type, which stage dominates its latency". Add a p95 panel grouped by both tx_type and stage, filterable via the $tx_type and $stage variables. Both dimensions already exist in spanmetrics, so no collector change is needed. Reflow the section so the full-width failure panel sits below the new full-width panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:39:59 +01:00
Pratik Mankawde	f37a4a1022	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill # Conflicts: # src/xrpld/app/misc/detail/TxQ.cpp	2026-06-05 12:49:38 +01:00
Pratik Mankawde	8f3974c094	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 12:48:40 +01:00
Pratik Mankawde	283fbaa54f	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 12:48:31 +01:00
Pratik Mankawde	3167a49f41	feat(telemetry): derive per-stage tx metrics from apply-pipeline spans Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim, tx.transactor) added on phase-3 through the observability stack so the spanmetrics connector produces per-stage RED metrics without any native instruments. - collector: add the `stage` dimension to the spanmetrics connector so the three stages split into separate metric series (3 bounded values). - dashboard: add a "Tx Apply Pipeline" section to transaction-overview with rate, p95 latency, and failure-rate panels grouped by stage, plus a `stage` template variable. Panels follow the existing config (node filter, exported_instance legends, Title Case, axis labels). - The failure panel filters ter_result != tesSUCCESS rather than span status, because a failing ter code completes the span normally — only thrown exceptions set an error status. This matches the existing "Transaction Results by Type" panel convention. - docs: document the spans, attributes, and stage dimension in the data collection reference and runbook, including the sampling caveat that span-derived metrics inherit tracer head-sampling and undercount at sampling_ratio < 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:42:53 +01:00
Pratik Mankawde	d7e847a53b	removed p50 renders from all dashboards Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 18:11:23 +01:00
Pratik Mankawde	d6b314e8d5	fix(telemetry): trim Tempo search filters to 7 cross-cutting entry points Reduced from 30 to 7 filters: service.instance.id, name, status, command, tx_hash, tx_type, ledger_hash. Full attribute inventory is in OpenTelemetryPlan/09-data-collection-reference.md §4; TraceQL autocomplete covers the rest. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-06-04 17:43:26 +01:00
Pratik Mankawde	0a800069bf	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 16:43:25 +01:00
Pratik Mankawde	938a4d17ce	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-04 16:43:25 +01:00
Pratik Mankawde	ca3a78abce	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 16:43:25 +01:00
Pratik Mankawde	eef11a65fa	fix(telemetry): code-review dashboard cleanups (legends + stale descriptions) From the code-review pass: - transaction-overview.json: the tx.process and tx.transactor latency-by-type panels used lowercase legends (p95/p50) without the per-node dimension. Use Title Case (P95/P50), add exported_instance to the by() clause, and include [{{exported_instance}}] in the legend, per the dashboard legend convention. - consensus-health.json: panel descriptions still referenced the old dotted attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename; update them to the bare emitted names (consensus_mode, ledger_seq). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 16:43:12 +01:00
Pratik Mankawde	c20d10fd36	fix(telemetry): restore consensus_mode label fix lost in phase-8->9 merge The A1 fix (xrpl_consensus_mode -> consensus_mode) was applied on phase-6, but the phase-8->phase-9 merge conflict resolution for consensus-health.json took phase-9's pre-fix panel base, silently reintroducing all 11 stale xrpl_consensus_mode label references (the spanmetrics label that is never populated — see the original A1 commit). Re-apply the label fix on phase-9: xrpl_consensus_mode -> consensus_mode in every panel expr, legendFormat, and the $consensus_mode template variable's label_values() query. The Grafana variable name $consensus_mode is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 16:20:42 +01:00
Pratik Mankawde	7d8e908879	feat(telemetry): add dashboard panels for new T3 metrics Visualise the metrics added in this series: - consensus-health: "Ledger History Mismatch Rate by Reason" (xrpld_ledger_history_mismatch_total by reason — fork diagnostics) - fee-market: "Queue Abandonment Rate (Expired)" and "Queue Admission Rejections (Dropped)" (xrpld_txq_expired_total / dropped_total) - peer-network: "Reduce-Relay Peer Selection" and "Reduce-Relay Missing-Tx Frequency" (xrpld_reduce_relay_metrics) - system-node-health: "Ledger Acquire Duration" and "Ledger Acquire Rate by Outcome" (ledger.acquire span) otel-collector-config.yaml: add outcome and acquire_reason spanmetrics dimensions so the ledger.acquire outcome breakdown populates. All panels follow the existing template: $node filter, exported_instance in legends, Title Case, axis labels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 16:16:55 +01:00
Pratik Mankawde	d7baf262f8	fix(telemetry): remove duplicate consensus outcome/failures panels A phase-8->phase-9 merge (`a675897aaf`) duplicated the "Consensus Outcome Distribution" and "Consensus Failures Over Time" panels: both appeared twice with byte-identical queries (verified ignoring gridPos). The pair existed once on phase-6/7/8 and became two on phase-9 only, so the duplication originated in phase-9's own merge history. Remove the second (lower) copy of each and re-stack panel y-positions with no gaps. The single retained copy keeps the original y=64 row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:51:52 +01:00
Pratik Mankawde	b286335ccf	feat(telemetry): add load-factor attribution and 7-day agreement panels Both metrics are already emitted and live in Prometheus but were not fully visualised. - Fee Market (xrpld-fee-market.json): "Load Factor Attribution (Stacked Components)" — stacks load_factor_fee_escalation / fee_queue / local / net / cluster so an operator can see which component drives the effective fee. The existing panels showed the aggregate only. - Validator Health (xrpld-validator-health.json): "Agreement % (7d)" and "Agreements vs Missed (7d)" — the xrpld_validation_agreement gauge already observes agreement_pct_7d / agreements_7d / missed_7d, but the dashboard only plotted 1h and 24h windows. Panels follow the existing template: $node filter, exported_instance in legends, Title Case, axis labels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:44:07 +01:00
Pratik Mankawde	5c2997d95e	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill # Conflicts: # docker/telemetry/grafana/dashboards/consensus-health.json	2026-06-04 15:41:20 +01:00
Pratik Mankawde	342b9f55a1	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 15:40:17 +01:00
Pratik Mankawde	000ad1d1f5	feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard) The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp) are emitted but had no dashboard coverage. The existing RPC & Pathfinding dashboard only plotted StatsD timers. Add span-derived rows: - gRPC Request Rate by Method (grpc.* by method) - gRPC Latency P95 by Method - gRPC Error Rate by Status (by grpc_status) - Pathfinding Compute Duration (pathfind.compute p95/p50) - Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover) otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics dimensions (bounded value sets). Add a $grpc_method template variable so the gRPC panels can be filtered by method, consistent with the dashboard filter conventions. Note: these spans populate only when the node serves gRPC / pathfinding traffic; they are correct but not exercised by the current health-check workload (they will be covered by the Phase 10 workload generator). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:40:07 +01:00
Pratik Mankawde	17ffe8b049	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 15:37:55 +01:00
Pratik Mankawde	63c6f3b8df	feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards The consensus state-machine and TxQ lifecycle spans are emitted by the code and present in Prometheus, but no panel visualised them. Add panels keyed on those span_names (verified live) plus the low-cardinality dimensions needed to break them down. Consensus Health (consensus-health.json) — new rows: - Consensus Round Duration (full round, p95/p50, mode-filterable) - Consensus Phase Duration (open vs establish breakdown) - Position Update Duration (update_positions p95/p50) - Consensus Stall Rate (consensus.check by consensus_stalled) - Consensus Mode-Change Rate by Target Mode (mode_change by mode_new) Transaction Overview (transaction-overview.json) — new rows: - TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type) - Queue Bypass Ratio (txq.apply_direct vs txq.enqueue) - Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50) - Queue Cleanup Rate (txq.cleanup expired entries) otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result (all bounded value sets, safe as Prometheus labels). All new panels follow the existing dashboard template: $node filter, exported_instance in every legend, Title Case, axis labels, row layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:37:29 +01:00
Pratik Mankawde	4174aef07b	fix(telemetry): align consensus_mode spanmetrics label with emitted attribute The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code emits the span attribute under the bare key `consensus_mode` (matching every other dimension after the Phase 6 rename). The mismatch left the `xrpl_consensus_mode` Prometheus label empty, so the Consensus Health "Consensus Mode Over Time" panel and the `$consensus_mode` template variable (which filters every panel) matched no live series. - otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode` - consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode` (the `$consensus_mode` Grafana variable name is unchanged) - telemetry-runbook.md: refresh the stale spanmetrics label table to the bare names actually emitted (command/rpc_status/consensus_mode/local/ proposal_trusted/validation_trusted), fix dotted->bare attribute names in span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id, consensus_ledger_id, consensus_round, tx_id event attr), correct the consensus_round_id query to int (not quoted string), and fix the load_type value query ("exception_rpc" -> "exceptioned RPC"). Verified against the live stack: Tempo span tags confirm bare attribute keys (consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode series in Prometheus is stale retained data from an older build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:29:45 +01:00
Pratik Mankawde	e6643a4389	updated tags Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 14:46:57 +01:00
Pratik Mankawde	80800ee130	use image-renderer in graphana Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 14:40:35 +01:00
Pratik Mankawde	ebc5c5ed9d	fix(telemetry): set service_instance_id in [insight] so dashboards filter beast::insight metrics exported via OTLP carried no exported_instance label because [insight] omitted service_instance_id (only [telemetry] set it). Every system-* dashboard filters insight metrics with exported_instance=~"$node", and the $node template variable is sourced from label_values(..., exported_instance) — so with the label absent, $node was empty and all insight-backed panels showed no data. Add service_instance_id to [insight] in both telemetry configs, matching the [telemetry] value (xrpld-mainnet / xrpld-devnet). CollectorManager already reads this key and passes it to OTelCollector, which sets the service.instance.id resource attribute. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:36:04 +01:00
Pratik Mankawde	61c2760296	consmetic updates Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 14:32:13 +01:00
Pratik Mankawde	88ac4b6aee	fix(telemetry): use short unit for NodeStore and object-count panels The phase-9 NodeStore I/O totals, write-load/read-queue, read-threads, and object instance-count panels rendered large cumulative values with unit "none". Switch to "short" for readable abbreviation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:27:53 +01:00
Pratik Mankawde	a5f80514a9	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:26:16 +01:00
Pratik Mankawde	90f7a8bd4e	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-04 14:26:16 +01:00
Pratik Mankawde	45ab508ed8	fix(telemetry): use short unit for large count/message panels Count and message-volume panels (operating-mode transitions, job queue depth, network/overlay message totals, getobject message counts) used unit "none", rendering large values as raw unscaled numbers. Switch to "short" so Grafana abbreviates (e.g. 1.5 Mil) for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:26:03 +01:00
Pratik Mankawde	a6cebf21b0	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill # Conflicts: # docker/telemetry/grafana/dashboards/system-node-health.json	2026-06-04 14:06:46 +01:00
Pratik Mankawde	6c71aa8c2a	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:05:25 +01:00
Pratik Mankawde	9b46a343fc	fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics The system-* dashboards queried the legacy StatsD rippled_ prefix, but the node now emits beast::insight metrics via native OTLP under the xrpld_ prefix (config: [insight] server=otel, prefix=xrpld). All queries returned no data. Migration (names derived from C++ beast::insight registrations, not live Prometheus, since a syncing node does not emit every metric yet): - rippled_ -> xrpld_ prefix across all panel queries and template variables (including the $node variable query, which broke the whole dashboard filter) - Histogram Event instruments export with unit ms, so bare _bucket becomes _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full - Job-type metrics were StatsD summaries (label quantile="$quantile"); on the OTLP path they are histograms. Converted those queries to histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m])) and added the previously-undefined $quantile template variable - Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or job types the syncing node has not run) show no data until the path executes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:01:13 +01:00
Pratik Mankawde	10b4112382	fix(telemetry): use p75/p99 quantiles and add gauge panels for job/rpc latency P100 from a histogram is degenerate — it always returns the upper bound of the highest populated bucket (a single slow outlier pins it to the top boundary), producing a flat line. Revert to meaningful quantiles: - Job Queue Wait Time / Job Execution Time: p75 (typical) + p99 (tail) - Per-Job-Type / Per-Method: p99 - Added gauge panels showing current p99 with green/yellow/red thresholds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 12:46:58 +01:00
Pratik Mankawde	859bd21ca5	only render p100. Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 12:16:48 +01:00
Pratik Mankawde	15d3e3a375	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 11:28:04 +01:00
Pratik Mankawde	0fe09cda9b	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 11:28:04 +01:00
Pratik Mankawde	a9cc1067d0	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-04 11:28:04 +01:00
Pratik Mankawde	194f5b8af8	fix(telemetry): set ms unit on duration heatmap y-axes The three duration heatmaps (transaction, consensus accept, RPC latency) had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick values rendered unscaled. Set unit=ms on both the yAxis options and panel defaults so buckets display as proper millisecond values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 11:27:46 +01:00
Pratik Mankawde	37c9168065	fix(telemetry): correct invalid 'us' unit code to 'µs' on duration panels Grafana does not recognize 'us' as a unit code, so microsecond values rendered as raw numbers with a plain 'us' suffix (no scaling). The correct code is 'µs'. Affects job-queue and OTel RPC latency panels backed by *_duration_us histograms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 11:26:43 +01:00
Pratik Mankawde	373012e84d	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-04 10:55:36 +01:00
Pratik Mankawde	8f9fa52f93	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:55:35 +01:00
Pratik Mankawde	fb7c3bc38d	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-06-04 10:55:27 +01:00
Pratik Mankawde	8e606bbaf4	feat(telemetry): add tx_type/ter_result/txq_status dashboard filters Adds template variables $tx_type, $ter_result, $txq_status to the Transaction Overview dashboard. All relevant panels now respect these filters, enabling operators to drill into specific transaction types or result codes. Changes: - Panel 2 renamed to "Transaction Processing Latency by Type" (now shows p95/p50 per tx_type instead of aggregate) - Panels 1,3,4,5,7,9,12 filter by $tx_type - Panel 10 filters by $tx_type and $ter_result - Panel 11 filters by $txq_status - Removed redundant "TX Processing Latency by Type (p95)" panel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:55:11 +01:00

1 2 3 4

175 Commits