diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 4f69ccfab8..e3869b0242 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -910,19 +910,112 @@ rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_si rippled_load_factor_metrics{metric="load_factor"} > 5 ``` +### Phase 7+: External Dashboard Parity Metrics + +> **Source**: [External Dashboard Parity Spec](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — metrics inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). +> +> **Task breakdown**: Phase 7 Tasks 7.9-7.16 (implementation), Phase 9 Tasks 9.11-9.13 (dashboards) + +These metrics fill gaps identified by comparing rippled's internal observability with the community external dashboard's 86-metric coverage. All are exported via the OTel Metrics SDK (same `PeriodicMetricReader` as Phase 9 metrics). + +#### Validation Agreement (Observable Gauge — `validation_agreement`) + +| Prometheus Metric | Type | Labels | Description | +| ---------------------------------------------------------- | ------ | -------- | --------------------------------------- | +| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Double | `metric` | Rolling 1h agreement percentage (0-100) | +| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Double | `metric` | Rolling 24h agreement percentage | +| `rippled_validation_agreement{metric="agreements_1h"}` | Int64 | `metric` | Agreed validations in 1h window | +| `rippled_validation_agreement{metric="missed_1h"}` | Int64 | `metric` | Missed validations in 1h window | +| `rippled_validation_agreement{metric="agreements_24h"}` | Int64 | `metric` | Agreed validations in 24h window | +| `rippled_validation_agreement{metric="missed_24h"}` | Int64 | `metric` | Missed validations in 24h window | + +Data source: `ValidationTracker` class with 8s grace period and 5m late repair window. + +#### Validator Health (Observable Gauge — `validator_health`) + +| Prometheus Metric | Type | Labels | Description | +| ------------------------------------------------------ | ------ | -------- | ------------------------------ | +| `rippled_validator_health{metric="amendment_blocked"}` | Int64 | `metric` | 1 if amendment-blocked, else 0 | +| `rippled_validator_health{metric="unl_blocked"}` | Int64 | `metric` | 1 if UNL-blocked, else 0 | +| `rippled_validator_health{metric="unl_expiry_days"}` | Double | `metric` | Days until UNL list expires | +| `rippled_validator_health{metric="validation_quorum"}` | Int64 | `metric` | Validation quorum threshold | + +#### Peer Quality (Observable Gauge — `peer_quality`) + +| Prometheus Metric | Type | Labels | Description | +| --------------------------------------------------------- | ------ | -------- | ------------------------------------ | +| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | Double | `metric` | P90 peer latency in milliseconds | +| `rippled_peer_quality{metric="peers_insane_count"}` | Int64 | `metric` | Peers with diverged tracking status | +| `rippled_peer_quality{metric="peers_higher_version_pct"}` | Double | `metric` | % of peers on newer rippled version | +| `rippled_peer_quality{metric="upgrade_recommended"}` | Int64 | `metric` | 1 if >60% of peers are newer version | + +#### Ledger Economy (Observable Gauge — `ledger_economy`) + +| Prometheus Metric | Type | Labels | Description | +| ----------------------------------------------------- | ------ | -------- | ---------------------------------- | +| `rippled_ledger_economy{metric="base_fee_xrp"}` | Double | `metric` | Base transaction fee in drops | +| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Double | `metric` | Account reserve in drops | +| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Double | `metric` | Owner reserve increment in drops | +| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Double | `metric` | Seconds since last validated close | +| `rippled_ledger_economy{metric="transaction_rate"}` | Double | `metric` | Smoothed transaction rate (tx/s) | + +#### State Tracking (Observable Gauge — `state_tracking`) + +| Prometheus Metric | Type | Labels | Description | +| ---------------------------------------------------------------- | ------ | -------- | -------------------------------------- | +| `rippled_state_tracking{metric="state_value"}` | Int64 | `metric` | Numeric state 0-6 (see encoding below) | +| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Double | `metric` | Duration in current state | + +State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (FULL + validating), 6=proposing (FULL + proposing). + +#### Storage Detail (Observable Gauge — `storage_detail`) + +| Prometheus Metric | Type | Labels | Description | +| --------------------------------------------- | ----- | -------- | ---------------------- | +| `rippled_storage_detail{metric="nudb_bytes"}` | Int64 | `metric` | NuDB backend file size | + +#### Synchronous Counters (Phase 7+) + +| Prometheus Metric | Type | Description | Increment Site | +| ------------------------------------- | ------- | -------------------------------- | --------------------- | +| `rippled_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp | +| `rippled_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp | +| `rippled_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp | +| `rippled_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp | +| `rippled_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp | +| `rippled_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp | +| `rippled_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp | + +#### Span Attribute Enrichments (Phases 2-4) + +| Span Name | New Attribute | Type | Source | +| --------------------------- | ------------------------------------ | ------ | ------------------------ | +| `rpc.command.*` | `xrpl.node.amendment_blocked` | bool | Phase 2 — RPCHandler.cpp | +| `rpc.command.*` | `xrpl.node.server_state` | string | Phase 2 — RPCHandler.cpp | +| `tx.receive` | `xrpl.peer.version` | string | Phase 3 — PeerImp.cpp | +| `consensus.validation.send` | `xrpl.validation.ledger_hash` | string | Phase 4 — RCLConsensus | +| `consensus.validation.send` | `xrpl.validation.full` | bool | Phase 4 — RCLConsensus | +| `peer.validation.receive` | `xrpl.peer.validation.ledger_hash` | string | Phase 4 — PeerImp.cpp | +| `peer.validation.receive` | `xrpl.peer.validation.full` | bool | Phase 4 — PeerImp.cpp | +| `consensus.accept` | `xrpl.consensus.validation_quorum` | int64 | Phase 4 — RCLConsensus | +| `consensus.accept` | `xrpl.consensus.proposers_validated` | int64 | Phase 4 — RCLConsensus | + ### New Grafana Dashboards (Phase 9) -| Dashboard | UID | Data Source | Key Panels | -| ---------------------- | -------------------- | ----------- | --------------------------------------------------------- | -| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown | -| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times | -| RPC Performance (OTel) | `rippled-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions | +| Dashboard | UID | Data Source | Key Panels | +| ---------------------- | -------------------------- | ----------- | --------------------------------------------------------- | +| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown | +| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times | +| RPC Performance (OTel) | `rippled-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions | +| Validator Health | `rippled-validator-health` | Prometheus | Agreement %, validation rate, amendment/UNL, state | +| Peer Quality | `rippled-peer-quality` | Prometheus | P90 latency, insane peers, version awareness, disconnects | ### Updated Grafana Dashboards (Phase 9) -| Dashboard | UID | New Panels Added | -| -------------------- | ---------------------------- | ------------------------------------------------------ | -| Node Health (StatsD) | `rippled-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts | +| Dashboard | UID | New Panels Added | +| -------------------- | ---------------------------- | -------------------------------------------------------------------- | +| Node Health (StatsD) | `rippled-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts | +| System Node Health | `rippled-system-node-health` | Ledger economy row: base fee, reserves, ledger age, transaction rate | ### New Grafana Dashboards (Phase 11) diff --git a/OpenTelemetryPlan/Phase9_taskList.md b/OpenTelemetryPlan/Phase9_taskList.md index 69af4d9263..2ede785bd0 100644 --- a/OpenTelemetryPlan/Phase9_taskList.md +++ b/OpenTelemetryPlan/Phase9_taskList.md @@ -342,6 +342,96 @@ These metrics serve multiple external consumer categories identified during rese --- +## Task 9.11: Validator Health Dashboard (External Dashboard Parity) + +> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — dashboards for Phase 7 metrics inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). +> +> **Upstream**: Phase 7 Tasks 7.9-7.16 (metrics must be emitting). +> **Downstream**: Phase 10 (dashboard load checks), Phase 11 (alert rules reference these panels). + +**Objective**: Create a Grafana dashboard for validation agreement, amendment/UNL health, and state tracking. + +**Dashboard**: `rippled-validator-health.json` + +| Panel | Type | PromQL | +| -------------------------- | ---------- | ---------------------------------------------------------------- | +| Agreement % (1h) | stat | `rippled_validation_agreement{metric="agreement_pct_1h"}` | +| Agreement % (24h) | stat | `rippled_validation_agreement{metric="agreement_pct_24h"}` | +| Agreements vs Missed (1h) | bargauge | `agreements_1h` and `missed_1h` side by side | +| Agreements vs Missed (24h) | bargauge | `agreements_24h` and `missed_24h` side by side | +| Validation Rate | stat | `rate(rippled_validations_sent_total[5m]) * 60` | +| Validations Checked Rate | stat | `rate(rippled_validations_checked_total[5m]) * 60` | +| Amendment Blocked | stat | `rippled_validator_health{metric="amendment_blocked"}` | +| UNL Expiry (days) | stat | `rippled_validator_health{metric="unl_expiry_days"}` | +| Validation Quorum | stat | `rippled_validator_health{metric="validation_quorum"}` | +| State Value Timeline | timeseries | `rippled_state_tracking{metric="state_value"}` | +| Time in Current State | stat | `rippled_state_tracking{metric="time_in_current_state_seconds"}` | +| State Changes Rate | stat | `rate(rippled_state_changes_total[1h])` | +| Ledgers Closed Rate | stat | `rate(rippled_ledgers_closed_total[5m]) * 60` | + +**Dashboard conventions**: `$node` template variable for `exported_instance` filtering, dark theme, matching existing panel sizes and color schemes. + +**Key new files**: `docker/telemetry/grafana/dashboards/rippled-validator-health.json` + +**Exit Criteria**: + +- [ ] All 13 panels render with non-zero data during normal operation +- [ ] `$node` filter works correctly for multi-node deployments +- [ ] Amendment blocked and UNL expiry panels use color thresholds (red=blocked/expiring) + +--- + +## Task 9.12: Peer Quality Dashboard (External Dashboard Parity) + +> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) + +**Objective**: Create a Grafana dashboard for peer health aggregates. + +**Dashboard**: `rippled-peer-quality.json` + +| Panel | Type | PromQL | +| ---------------------- | ---------- | ---------------------------------------------------------------- | +| P90 Peer Latency | timeseries | `rippled_peer_quality{metric="peer_latency_p90_ms"}` | +| Insane/Diverged Peers | stat | `rippled_peer_quality{metric="peers_insane_count"}` | +| Higher Version Peers % | stat | `rippled_peer_quality{metric="peers_higher_version_pct"}` | +| Upgrade Recommended | stat | `rippled_peer_quality{metric="upgrade_recommended"}` | +| Resource Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects_Charges` | +| Inbound vs Outbound | bargauge | `rippled_Peer_Finder_Active_Inbound_Peers`, `..._Outbound_Peers` | + +**Key new files**: `docker/telemetry/grafana/dashboards/rippled-peer-quality.json` + +**Exit Criteria**: + +- [ ] All 6 panels render correctly +- [ ] P90 latency panel shows trend over time +- [ ] Upgrade recommended panel uses color threshold (red=1, green=0) + +--- + +## Task 9.13: Ledger Economy Dashboard Panels (External Dashboard Parity) + +> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) + +**Objective**: Add "Ledger Economy" row to the existing `system-node-health.json` dashboard. + +| Panel | Type | PromQL | +| -------------------- | ---------- | ----------------------------------------------------- | +| Base Fee (drops) | stat | `rippled_ledger_economy{metric="base_fee_xrp"}` | +| Reserve Base (drops) | stat | `rippled_ledger_economy{metric="reserve_base_xrp"}` | +| Reserve Inc (drops) | stat | `rippled_ledger_economy{metric="reserve_inc_xrp"}` | +| Ledger Age | stat | `rippled_ledger_economy{metric="ledger_age_seconds"}` | +| Transaction Rate | timeseries | `rippled_ledger_economy{metric="transaction_rate"}` | + +**Key modified files**: `docker/telemetry/grafana/dashboards/system-node-health.json` + +**Exit Criteria**: + +- [ ] 5 new panels render correctly in existing dashboard +- [ ] Fee values match `server_info` RPC output +- [ ] Transaction rate shows smooth trend (not spiky) + +--- + ## Exit Criteria - [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline @@ -352,3 +442,6 @@ These metrics serve multiple external consumer categories identified during rese - [ ] Integration test validates all new metric families are non-zero - [ ] No performance regression (< 0.5% CPU overhead from new callbacks) - [ ] Documentation updated with full new metric inventory +- [ ] Validator Health dashboard renders all 13 panels +- [ ] Peer Quality dashboard renders all 6 panels +- [ ] Ledger Economy panels added to system-node-health dashboard diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index 4a051cdd4e..18badfc2e7 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -257,7 +257,7 @@ These gauges are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s i ## Grafana Dashboards -Thirteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: +Fifteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: ### RPC Performance (`rippled-rpc-perf`) @@ -484,6 +484,147 @@ Add to your Prometheus alerting rules configuration. | `JobQueueBacklog` | Warning | `sum(rate(rippled_job_queued_total[5m])) - sum(rate(rippled_job_finished_total[5m])) > 100` | 5m | Jobs are being queued faster than they're completing | | `SlowJobExecution` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_job_running_duration_us_bucket[5m]))) > 10000000` | 5m | Job execution p95 exceeds 10 seconds | +## Validator Health Monitoring (Phase 7+) + +Phase 7 introduces native metrics for validator health, validation agreement, peer quality, ledger economy, and state tracking — inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). These metrics are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s interval). + +### Validation Agreement + +The `ValidationTracker` class computes rolling validation agreement between this node and network consensus. It maintains 1h and 24h sliding windows with an 8-second grace period and 5-minute late repair window. + +| Prometheus Metric | Description | +| ---------------------------------------------------------- | ------------------------------ | +| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Agreement % over last 1 hour | +| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Agreement % over last 24 hours | +| `rippled_validation_agreement{metric="agreements_1h"}` | Agreed validations in 1h | +| `rippled_validation_agreement{metric="missed_1h"}` | Missed validations in 1h | +| `rippled_validation_agreement{metric="agreements_24h"}` | Agreed validations in 24h | +| `rippled_validation_agreement{metric="missed_24h"}` | Missed validations in 24h | +| `rippled_validations_sent_total` | Total validations sent | +| `rippled_validations_checked_total` | Total network validations seen | +| `rippled_validation_agreements_total` | Cumulative agreements | +| `rippled_validation_missed_total` | Cumulative misses | + +**How reconciliation works**: + +1. When the node sends a validation for ledger X, the tracker records `weValidated=true` +2. When the network validates a ledger, the tracker records `networkValidated=true` +3. After an 8-second grace period, the tracker reconciles: if both are true for the same ledger hash, it's an agreement; otherwise, a miss +4. If a late validation arrives within 5 minutes, a previous miss can be corrected (late repair) + +**When to worry**: Agreement below 90% over 24h indicates the node is missing network consensus — check connectivity, clock sync, and whether the node is in `Full` mode. + +```promql +# Agreement percentage over 24 hours +rippled_validation_agreement{metric="agreement_pct_24h"} + +# Validation send rate (should be ~1 per 3-5s during normal operation) +rate(rippled_validations_sent_total[5m]) * 60 + +# Ratio of agreements to total reconciled +rippled_validation_agreements_total / (rippled_validation_agreements_total + rippled_validation_missed_total) +``` + +### Validator Health Gauges + +| Prometheus Metric | Description | Healthy Value | +| ------------------------------------------------------ | ----------------------------------- | ----------------------- | +| `rippled_validator_health{metric="amendment_blocked"}` | 1 if amendment-blocked, 0 if not | 0 | +| `rippled_validator_health{metric="unl_blocked"}` | 1 if UNL-blocked, 0 if not | 0 | +| `rippled_validator_health{metric="unl_expiry_days"}` | Days until UNL list expires | > 14 | +| `rippled_validator_health{metric="validation_quorum"}` | Current validation quorum threshold | Network-dependent (~28) | + +```promql +# Alert if amendment blocked +rippled_validator_health{metric="amendment_blocked"} == 1 + +# Alert if UNL expiring within 14 days +rippled_validator_health{metric="unl_expiry_days"} < 14 +``` + +### Peer Quality Monitoring + +| Prometheus Metric | Description | +| --------------------------------------------------------- | --------------------------------------- | +| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | P90 peer latency in milliseconds | +| `rippled_peer_quality{metric="peers_insane_count"}` | Peers with diverged/insane tracking | +| `rippled_peer_quality{metric="peers_higher_version_pct"}` | % of peers running a newer version | +| `rippled_peer_quality{metric="upgrade_recommended"}` | 1 if >60% of peers are on newer version | +| `rippled_Overlay_Peer_Disconnects_Charges` | Disconnects due to resource charges | + +**Key insight**: If `upgrade_recommended` is 1, the node is running an older version than the majority of the network. This doesn't affect functionality immediately but may cause issues when amendments activate. + +```promql +# P90 peer latency trend +rippled_peer_quality{metric="peer_latency_p90_ms"} + +# Correlate high latency with disconnects +rippled_peer_quality{metric="peer_latency_p90_ms"} > 500 + and rate(rippled_Overlay_Peer_Disconnects_Charges[5m]) > 0 +``` + +### Ledger Economy Monitoring + +| Prometheus Metric | Description | +| ----------------------------------------------------- | ---------------------------------- | +| `rippled_ledger_economy{metric="base_fee_xrp"}` | Base fee in drops | +| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Account reserve in drops | +| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Owner reserve increment in drops | +| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Seconds since last validated close | +| `rippled_ledger_economy{metric="transaction_rate"}` | Smoothed transaction rate | +| `rippled_ledgers_closed_total` | Total ledgers closed | + +```promql +# Fee values (should match server_info output) +rippled_ledger_economy{metric="base_fee_xrp"} + +# Ledger age — should reset to ~0 every 3-5s +rippled_ledger_economy{metric="ledger_age_seconds"} + +# Ledger close rate (should be ~12-20 per minute) +rate(rippled_ledgers_closed_total[5m]) * 60 +``` + +### State Tracking + +| Prometheus Metric | Description | +| ---------------------------------------------------------------- | ------------------------------ | +| `rippled_state_tracking{metric="state_value"}` | Numeric state (0-6, see table) | +| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Duration in current state | +| `rippled_state_changes_total` | Total state transitions | + +**State value encoding**: + +| Value | State | Meaning | +| ----- | ------------ | ---------------------------------------------------- | +| 0 | disconnected | No network connectivity | +| 1 | connected | Connected but not syncing | +| 2 | syncing | Fetching ledger history | +| 3 | tracking | Following network but not fully validated | +| 4 | full | Fully synced, not validating | +| 5 | validating | Fully synced and validating | +| 6 | proposing | Fully synced, validating, and proposing in consensus | + +Values 5-6 combine `OperatingMode` (0-4) with `ConsensusMode` (validating/proposing) to give a richer picture of node participation. + +```promql +# State timeline (should stay at 5 or 6 for validators) +rippled_state_tracking{metric="state_value"} + +# Alert on frequent state changes (flapping) +rate(rippled_state_changes_total[1h]) > 2 +``` + +### Grafana Dashboards (Phase 9) + +| Dashboard | UID | Panels | Key Metrics | +| ------------------ | -------------------------- | ------ | --------------------------------------------------------- | +| Validator Health | `rippled-validator-health` | 13 | Agreement %, validation rate, amendment/UNL health, state | +| Peer Quality | `rippled-peer-quality` | 6 | P90 latency, insane peers, version awareness | +| System Node Health | (updated) | +5 | Ledger economy row: fee, reserves, age, tx rate | + +--- + ## Troubleshooting ### No OTel SDK metrics in Prometheus