Files
rippled/docker/telemetry/grafana/dashboards/rippled-rpc-perf.json
Pratik Mankawde 5601615952 fix(telemetry): align Phase 9 dashboards and integration-test with xrpld_ metric prefix
MetricsRegistry emits OTel SDK metrics with the xrpld_ prefix
(MetricsRegistry.cpp defines "xrpld_nodestore_state",
"xrpld_cache_metrics", etc.), but the Phase 9 dashboards and the
Step 10c integration-test assertions introduced in 892fee638a
queried the rippled_ prefix. Every Phase 9 panel and assertion
therefore rendered "No data" or failed on a live run, even though
the underlying series were being exported correctly.

Rename the rippled_ prefix to xrpld_ for every MetricsRegistry
metric in dashboards and the integration test:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us_bucket
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement (ValidationTracker 1h/24h/7d windows)

Also add ValidationTracker window-gauge assertions to Step 10c of
integration-test.sh so the 1h/24h/7d agreement and miss counts are
checked alongside the other Phase 9 gauges.

The rippled_ prefix is preserved for beast::insight metrics
(rippled_LedgerMaster_*, rippled_Peer_Finder_*, rippled_total_*,
rippled_Overlay_*, rippled_State_Accounting_*, rippled_transactions_*,
rippled_proposals_*, rippled_validations_Messages_*) because those
flow through the StatsD-style OTelCollector configured with
`[insight] prefix=rippled` and remain on that prefix by design.

Verified against a live 6-node consensus network: all 22 Phase 9 +
ValidationTracker assertions now report 6+ series per metric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:59:00 +01:00

405 lines
11 KiB
JSON

{
"annotations": {
"list": []
},
"description": "Per-RPC-method performance: call rates, error rates, and latency distributions. Sourced from OTel MetricsRegistry synchronous counters and histograms (Phase 9).",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "RPC Call Rate (All Methods)",
"description": "Aggregate rate of RPC calls started, finished, and errored across all methods. Computed as rate() over OTel counters.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(xrpld_rpc_method_started_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]))",
"legendFormat": "Started/s [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(xrpld_rpc_method_finished_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]))",
"legendFormat": "Finished/s [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (exported_instance) (rate(xrpld_rpc_method_errored_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]))",
"legendFormat": "Errored/s [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 10,
"axisLabel": "Operations / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
}
},
"overrides": []
}
},
{
"title": "Per-Method Call Rate (Top 10)",
"description": "Per-method RPC call rate, showing the 10 most active methods. Useful for identifying hot paths.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk(10, rate(xrpld_rpc_method_started_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]))",
"legendFormat": "{{method}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"drawStyle": "line",
"lineWidth": 1,
"fillOpacity": 5,
"axisLabel": "Operations / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
}
},
"overrides": []
}
},
{
"title": "Per-Method Error Rate (Top 10)",
"description": "Per-method RPC error rate. Non-zero values warrant investigation. Common culprits: invalid parameters, resource exhaustion.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk(10, rate(xrpld_rpc_method_errored_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]))",
"legendFormat": "{{method}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"drawStyle": "line",
"lineWidth": 1,
"fillOpacity": 5,
"axisLabel": "Operations / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
}
},
"overrides": []
}
},
{
"title": "RPC Latency (P50, P95, P99) - All Methods",
"description": "Histogram quantiles for RPC execution time across all methods. Sourced from rpc_method_duration_us histogram.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(xrpld_rpc_method_duration_us_bucket{exported_instance=~\"$node\", method=~\"$method\"}[5m])))",
"legendFormat": "P50 [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(xrpld_rpc_method_duration_us_bucket{exported_instance=~\"$node\", method=~\"$method\"}[5m])))",
"legendFormat": "P95 [{{exported_instance}}]"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.99, sum by (le, exported_instance) (rate(xrpld_rpc_method_duration_us_bucket{exported_instance=~\"$node\", method=~\"$method\"}[5m])))",
"legendFormat": "P99 [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "us",
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 5,
"axisLabel": "Duration (μs)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
}
},
"overrides": []
}
},
{
"title": "Per-Method Latency P95 (Top 10 Slowest)",
"description": "95th percentile execution time per method. Identifies the slowest RPC endpoints.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk(10, histogram_quantile(0.95, sum by (le, method, exported_instance) (rate(xrpld_rpc_method_duration_us_bucket{exported_instance=~\"$node\", method=~\"$method\"}[5m]))))",
"legendFormat": "{{method}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "us",
"custom": {
"drawStyle": "line",
"lineWidth": 1,
"fillOpacity": 5,
"axisLabel": "Duration (μs)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
}
},
"overrides": []
}
},
{
"title": "RPC Error Ratio by Method",
"description": "Error ratio (errors / total started) per method. Values above 0.05 (5%) warrant investigation.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "topk(10, rate(xrpld_rpc_method_errored_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]) / (rate(xrpld_rpc_method_started_total{exported_instance=~\"$node\", method=~\"$method\"}[5m]) > 0))",
"legendFormat": "{{method}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"custom": {
"drawStyle": "line",
"lineWidth": 1,
"fillOpacity": 5,
"axisLabel": "Error Ratio",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
},
"color": {
"mode": "palette-classic"
},
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.05
},
{
"color": "red",
"value": 0.25
}
]
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "otel", "rpc"],
"templating": {
"list": [
{
"name": "node",
"label": "Node",
"description": "Filter by rippled node (service.instance.id)",
"type": "query",
"query": "label_values(exported_instance)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
},
{
"name": "method",
"label": "RPC Method",
"description": "Filter by RPC method",
"type": "query",
"query": "label_values(xrpld_rpc_method_started_total, method)",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"includeAll": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"multi": true,
"refresh": 2,
"sort": 1
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "RPC Performance (OTel)",
"uid": "rippled-rpc-perf",
"version": 1
}