docs: add external dashboard parity tasks and metric reference for Phase 9

Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-30 15:37:45 +01:00
parent 6738f8b9ab
commit 92d109ce16
3 changed files with 336 additions and 9 deletions

View File

@@ -910,19 +910,112 @@ rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_si
rippled_load_factor_metrics{metric="load_factor"} > 5
```
### Phase 7+: External Dashboard Parity Metrics
> **Source**: [External Dashboard Parity Spec](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — metrics inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
>
> **Task breakdown**: Phase 7 Tasks 7.9-7.16 (implementation), Phase 9 Tasks 9.11-9.13 (dashboards)
These metrics fill gaps identified by comparing rippled's internal observability with the community external dashboard's 86-metric coverage. All are exported via the OTel Metrics SDK (same `PeriodicMetricReader` as Phase 9 metrics).
#### Validation Agreement (Observable Gauge — `validation_agreement`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------- | ------ | -------- | --------------------------------------- |
| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Double | `metric` | Rolling 1h agreement percentage (0-100) |
| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Double | `metric` | Rolling 24h agreement percentage |
| `rippled_validation_agreement{metric="agreements_1h"}` | Int64 | `metric` | Agreed validations in 1h window |
| `rippled_validation_agreement{metric="missed_1h"}` | Int64 | `metric` | Missed validations in 1h window |
| `rippled_validation_agreement{metric="agreements_24h"}` | Int64 | `metric` | Agreed validations in 24h window |
| `rippled_validation_agreement{metric="missed_24h"}` | Int64 | `metric` | Missed validations in 24h window |
Data source: `ValidationTracker` class with 8s grace period and 5m late repair window.
#### Validator Health (Observable Gauge — `validator_health`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------ | ------ | -------- | ------------------------------ |
| `rippled_validator_health{metric="amendment_blocked"}` | Int64 | `metric` | 1 if amendment-blocked, else 0 |
| `rippled_validator_health{metric="unl_blocked"}` | Int64 | `metric` | 1 if UNL-blocked, else 0 |
| `rippled_validator_health{metric="unl_expiry_days"}` | Double | `metric` | Days until UNL list expires |
| `rippled_validator_health{metric="validation_quorum"}` | Int64 | `metric` | Validation quorum threshold |
#### Peer Quality (Observable Gauge — `peer_quality`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------------- | ------ | -------- | ------------------------------------ |
| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | Double | `metric` | P90 peer latency in milliseconds |
| `rippled_peer_quality{metric="peers_insane_count"}` | Int64 | `metric` | Peers with diverged tracking status |
| `rippled_peer_quality{metric="peers_higher_version_pct"}` | Double | `metric` | % of peers on newer rippled version |
| `rippled_peer_quality{metric="upgrade_recommended"}` | Int64 | `metric` | 1 if >60% of peers are newer version |
#### Ledger Economy (Observable Gauge — `ledger_economy`)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------------------- | ------ | -------- | ---------------------------------- |
| `rippled_ledger_economy{metric="base_fee_xrp"}` | Double | `metric` | Base transaction fee in drops |
| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Double | `metric` | Account reserve in drops |
| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Double | `metric` | Owner reserve increment in drops |
| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Double | `metric` | Seconds since last validated close |
| `rippled_ledger_economy{metric="transaction_rate"}` | Double | `metric` | Smoothed transaction rate (tx/s) |
#### State Tracking (Observable Gauge — `state_tracking`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------------- | ------ | -------- | -------------------------------------- |
| `rippled_state_tracking{metric="state_value"}` | Int64 | `metric` | Numeric state 0-6 (see encoding below) |
| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Double | `metric` | Duration in current state |
State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (FULL + validating), 6=proposing (FULL + proposing).
#### Storage Detail (Observable Gauge — `storage_detail`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------- | ----- | -------- | ---------------------- |
| `rippled_storage_detail{metric="nudb_bytes"}` | Int64 | `metric` | NuDB backend file size |
#### Synchronous Counters (Phase 7+)
| Prometheus Metric | Type | Description | Increment Site |
| ------------------------------------- | ------- | -------------------------------- | --------------------- |
| `rippled_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `rippled_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `rippled_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `rippled_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp |
| `rippled_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp |
| `rippled_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `rippled_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
#### Span Attribute Enrichments (Phases 2-4)
| Span Name | New Attribute | Type | Source |
| --------------------------- | ------------------------------------ | ------ | ------------------------ |
| `rpc.command.*` | `xrpl.node.amendment_blocked` | bool | Phase 2 — RPCHandler.cpp |
| `rpc.command.*` | `xrpl.node.server_state` | string | Phase 2 — RPCHandler.cpp |
| `tx.receive` | `xrpl.peer.version` | string | Phase 3 — PeerImp.cpp |
| `consensus.validation.send` | `xrpl.validation.ledger_hash` | string | Phase 4 — RCLConsensus |
| `consensus.validation.send` | `xrpl.validation.full` | bool | Phase 4 — RCLConsensus |
| `peer.validation.receive` | `xrpl.peer.validation.ledger_hash` | string | Phase 4 — PeerImp.cpp |
| `peer.validation.receive` | `xrpl.peer.validation.full` | bool | Phase 4 — PeerImp.cpp |
| `consensus.accept` | `xrpl.consensus.validation_quorum` | int64 | Phase 4 — RCLConsensus |
| `consensus.accept` | `xrpl.consensus.proposers_validated` | int64 | Phase 4 — RCLConsensus |
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | -------------------- | ----------- | --------------------------------------------------------- |
| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | `rippled-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions |
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | -------------------------- | ----------- | --------------------------------------------------------- |
| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | `rippled-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions |
| Validator Health | `rippled-validator-health` | Prometheus | Agreement %, validation rate, amendment/UNL, state |
| Peer Quality | `rippled-peer-quality` | Prometheus | P90 latency, insane peers, version awareness, disconnects |
### Updated Grafana Dashboards (Phase 9)
| Dashboard | UID | New Panels Added |
| -------------------- | ---------------------------- | ------------------------------------------------------ |
| Node Health (StatsD) | `rippled-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts |
| Dashboard | UID | New Panels Added |
| -------------------- | ---------------------------- | -------------------------------------------------------------------- |
| Node Health (StatsD) | `rippled-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts |
| System Node Health | `rippled-system-node-health` | Ledger economy row: base fee, reserves, ledger age, transaction rate |
### New Grafana Dashboards (Phase 11)

View File

@@ -342,6 +342,96 @@ These metrics serve multiple external consumer categories identified during rese
---
## Task 9.11: Validator Health Dashboard (External Dashboard Parity)
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — dashboards for Phase 7 metrics inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
>
> **Upstream**: Phase 7 Tasks 7.9-7.16 (metrics must be emitting).
> **Downstream**: Phase 10 (dashboard load checks), Phase 11 (alert rules reference these panels).
**Objective**: Create a Grafana dashboard for validation agreement, amendment/UNL health, and state tracking.
**Dashboard**: `rippled-validator-health.json`
| Panel | Type | PromQL |
| -------------------------- | ---------- | ---------------------------------------------------------------- |
| Agreement % (1h) | stat | `rippled_validation_agreement{metric="agreement_pct_1h"}` |
| Agreement % (24h) | stat | `rippled_validation_agreement{metric="agreement_pct_24h"}` |
| Agreements vs Missed (1h) | bargauge | `agreements_1h` and `missed_1h` side by side |
| Agreements vs Missed (24h) | bargauge | `agreements_24h` and `missed_24h` side by side |
| Validation Rate | stat | `rate(rippled_validations_sent_total[5m]) * 60` |
| Validations Checked Rate | stat | `rate(rippled_validations_checked_total[5m]) * 60` |
| Amendment Blocked | stat | `rippled_validator_health{metric="amendment_blocked"}` |
| UNL Expiry (days) | stat | `rippled_validator_health{metric="unl_expiry_days"}` |
| Validation Quorum | stat | `rippled_validator_health{metric="validation_quorum"}` |
| State Value Timeline | timeseries | `rippled_state_tracking{metric="state_value"}` |
| Time in Current State | stat | `rippled_state_tracking{metric="time_in_current_state_seconds"}` |
| State Changes Rate | stat | `rate(rippled_state_changes_total[1h])` |
| Ledgers Closed Rate | stat | `rate(rippled_ledgers_closed_total[5m]) * 60` |
**Dashboard conventions**: `$node` template variable for `exported_instance` filtering, dark theme, matching existing panel sizes and color schemes.
**Key new files**: `docker/telemetry/grafana/dashboards/rippled-validator-health.json`
**Exit Criteria**:
- [ ] All 13 panels render with non-zero data during normal operation
- [ ] `$node` filter works correctly for multi-node deployments
- [ ] Amendment blocked and UNL expiry panels use color thresholds (red=blocked/expiring)
---
## Task 9.12: Peer Quality Dashboard (External Dashboard Parity)
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md)
**Objective**: Create a Grafana dashboard for peer health aggregates.
**Dashboard**: `rippled-peer-quality.json`
| Panel | Type | PromQL |
| ---------------------- | ---------- | ---------------------------------------------------------------- |
| P90 Peer Latency | timeseries | `rippled_peer_quality{metric="peer_latency_p90_ms"}` |
| Insane/Diverged Peers | stat | `rippled_peer_quality{metric="peers_insane_count"}` |
| Higher Version Peers % | stat | `rippled_peer_quality{metric="peers_higher_version_pct"}` |
| Upgrade Recommended | stat | `rippled_peer_quality{metric="upgrade_recommended"}` |
| Resource Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects_Charges` |
| Inbound vs Outbound | bargauge | `rippled_Peer_Finder_Active_Inbound_Peers`, `..._Outbound_Peers` |
**Key new files**: `docker/telemetry/grafana/dashboards/rippled-peer-quality.json`
**Exit Criteria**:
- [ ] All 6 panels render correctly
- [ ] P90 latency panel shows trend over time
- [ ] Upgrade recommended panel uses color threshold (red=1, green=0)
---
## Task 9.13: Ledger Economy Dashboard Panels (External Dashboard Parity)
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md)
**Objective**: Add "Ledger Economy" row to the existing `system-node-health.json` dashboard.
| Panel | Type | PromQL |
| -------------------- | ---------- | ----------------------------------------------------- |
| Base Fee (drops) | stat | `rippled_ledger_economy{metric="base_fee_xrp"}` |
| Reserve Base (drops) | stat | `rippled_ledger_economy{metric="reserve_base_xrp"}` |
| Reserve Inc (drops) | stat | `rippled_ledger_economy{metric="reserve_inc_xrp"}` |
| Ledger Age | stat | `rippled_ledger_economy{metric="ledger_age_seconds"}` |
| Transaction Rate | timeseries | `rippled_ledger_economy{metric="transaction_rate"}` |
**Key modified files**: `docker/telemetry/grafana/dashboards/system-node-health.json`
**Exit Criteria**:
- [ ] 5 new panels render correctly in existing dashboard
- [ ] Fee values match `server_info` RPC output
- [ ] Transaction rate shows smooth trend (not spiky)
---
## Exit Criteria
- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline
@@ -352,3 +442,6 @@ These metrics serve multiple external consumer categories identified during rese
- [ ] Integration test validates all new metric families are non-zero
- [ ] No performance regression (< 0.5% CPU overhead from new callbacks)
- [ ] Documentation updated with full new metric inventory
- [ ] Validator Health dashboard renders all 13 panels
- [ ] Peer Quality dashboard renders all 6 panels
- [ ] Ledger Economy panels added to system-node-health dashboard

View File

@@ -257,7 +257,7 @@ These gauges are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s i
## Grafana Dashboards
Thirteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
Fifteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
### RPC Performance (`rippled-rpc-perf`)
@@ -484,6 +484,147 @@ Add to your Prometheus alerting rules configuration.
| `JobQueueBacklog` | Warning | `sum(rate(rippled_job_queued_total[5m])) - sum(rate(rippled_job_finished_total[5m])) > 100` | 5m | Jobs are being queued faster than they're completing |
| `SlowJobExecution` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_job_running_duration_us_bucket[5m]))) > 10000000` | 5m | Job execution p95 exceeds 10 seconds |
## Validator Health Monitoring (Phase 7+)
Phase 7 introduces native metrics for validator health, validation agreement, peer quality, ledger economy, and state tracking — inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). These metrics are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s interval).
### Validation Agreement
The `ValidationTracker` class computes rolling validation agreement between this node and network consensus. It maintains 1h and 24h sliding windows with an 8-second grace period and 5-minute late repair window.
| Prometheus Metric | Description |
| ---------------------------------------------------------- | ------------------------------ |
| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Agreement % over last 1 hour |
| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Agreement % over last 24 hours |
| `rippled_validation_agreement{metric="agreements_1h"}` | Agreed validations in 1h |
| `rippled_validation_agreement{metric="missed_1h"}` | Missed validations in 1h |
| `rippled_validation_agreement{metric="agreements_24h"}` | Agreed validations in 24h |
| `rippled_validation_agreement{metric="missed_24h"}` | Missed validations in 24h |
| `rippled_validations_sent_total` | Total validations sent |
| `rippled_validations_checked_total` | Total network validations seen |
| `rippled_validation_agreements_total` | Cumulative agreements |
| `rippled_validation_missed_total` | Cumulative misses |
**How reconciliation works**:
1. When the node sends a validation for ledger X, the tracker records `weValidated=true`
2. When the network validates a ledger, the tracker records `networkValidated=true`
3. After an 8-second grace period, the tracker reconciles: if both are true for the same ledger hash, it's an agreement; otherwise, a miss
4. If a late validation arrives within 5 minutes, a previous miss can be corrected (late repair)
**When to worry**: Agreement below 90% over 24h indicates the node is missing network consensus — check connectivity, clock sync, and whether the node is in `Full` mode.
```promql
# Agreement percentage over 24 hours
rippled_validation_agreement{metric="agreement_pct_24h"}
# Validation send rate (should be ~1 per 3-5s during normal operation)
rate(rippled_validations_sent_total[5m]) * 60
# Ratio of agreements to total reconciled
rippled_validation_agreements_total / (rippled_validation_agreements_total + rippled_validation_missed_total)
```
### Validator Health Gauges
| Prometheus Metric | Description | Healthy Value |
| ------------------------------------------------------ | ----------------------------------- | ----------------------- |
| `rippled_validator_health{metric="amendment_blocked"}` | 1 if amendment-blocked, 0 if not | 0 |
| `rippled_validator_health{metric="unl_blocked"}` | 1 if UNL-blocked, 0 if not | 0 |
| `rippled_validator_health{metric="unl_expiry_days"}` | Days until UNL list expires | > 14 |
| `rippled_validator_health{metric="validation_quorum"}` | Current validation quorum threshold | Network-dependent (~28) |
```promql
# Alert if amendment blocked
rippled_validator_health{metric="amendment_blocked"} == 1
# Alert if UNL expiring within 14 days
rippled_validator_health{metric="unl_expiry_days"} < 14
```
### Peer Quality Monitoring
| Prometheus Metric | Description |
| --------------------------------------------------------- | --------------------------------------- |
| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | P90 peer latency in milliseconds |
| `rippled_peer_quality{metric="peers_insane_count"}` | Peers with diverged/insane tracking |
| `rippled_peer_quality{metric="peers_higher_version_pct"}` | % of peers running a newer version |
| `rippled_peer_quality{metric="upgrade_recommended"}` | 1 if >60% of peers are on newer version |
| `rippled_Overlay_Peer_Disconnects_Charges` | Disconnects due to resource charges |
**Key insight**: If `upgrade_recommended` is 1, the node is running an older version than the majority of the network. This doesn't affect functionality immediately but may cause issues when amendments activate.
```promql
# P90 peer latency trend
rippled_peer_quality{metric="peer_latency_p90_ms"}
# Correlate high latency with disconnects
rippled_peer_quality{metric="peer_latency_p90_ms"} > 500
and rate(rippled_Overlay_Peer_Disconnects_Charges[5m]) > 0
```
### Ledger Economy Monitoring
| Prometheus Metric | Description |
| ----------------------------------------------------- | ---------------------------------- |
| `rippled_ledger_economy{metric="base_fee_xrp"}` | Base fee in drops |
| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Account reserve in drops |
| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Owner reserve increment in drops |
| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Seconds since last validated close |
| `rippled_ledger_economy{metric="transaction_rate"}` | Smoothed transaction rate |
| `rippled_ledgers_closed_total` | Total ledgers closed |
```promql
# Fee values (should match server_info output)
rippled_ledger_economy{metric="base_fee_xrp"}
# Ledger age — should reset to ~0 every 3-5s
rippled_ledger_economy{metric="ledger_age_seconds"}
# Ledger close rate (should be ~12-20 per minute)
rate(rippled_ledgers_closed_total[5m]) * 60
```
### State Tracking
| Prometheus Metric | Description |
| ---------------------------------------------------------------- | ------------------------------ |
| `rippled_state_tracking{metric="state_value"}` | Numeric state (0-6, see table) |
| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Duration in current state |
| `rippled_state_changes_total` | Total state transitions |
**State value encoding**:
| Value | State | Meaning |
| ----- | ------------ | ---------------------------------------------------- |
| 0 | disconnected | No network connectivity |
| 1 | connected | Connected but not syncing |
| 2 | syncing | Fetching ledger history |
| 3 | tracking | Following network but not fully validated |
| 4 | full | Fully synced, not validating |
| 5 | validating | Fully synced and validating |
| 6 | proposing | Fully synced, validating, and proposing in consensus |
Values 5-6 combine `OperatingMode` (0-4) with `ConsensusMode` (validating/proposing) to give a richer picture of node participation.
```promql
# State timeline (should stay at 5 or 6 for validators)
rippled_state_tracking{metric="state_value"}
# Alert on frequent state changes (flapping)
rate(rippled_state_changes_total[1h]) > 2
```
### Grafana Dashboards (Phase 9)
| Dashboard | UID | Panels | Key Metrics |
| ------------------ | -------------------------- | ------ | --------------------------------------------------------- |
| Validator Health | `rippled-validator-health` | 13 | Agreement %, validation rate, amendment/UNL health, state |
| Peer Quality | `rippled-peer-quality` | 6 | P90 latency, insane peers, version awareness |
| System Node Health | (updated) | +5 | Ledger economy row: fee, reserves, age, tx rate |
---
## Troubleshooting
### No OTel SDK metrics in Prometheus