mirror of
https://github.com/XRPLF/rippled.git
synced 2026-06-02 16:26:48 +00:00
fix: address CI rename checks (rippled -> xrpld) in phase-10 docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -846,7 +846,7 @@ flowchart LR
|
||||
|
||||
### Key Implementation Details
|
||||
|
||||
- **Transaction submitter and RPC load generator** both use rippled's native WebSocket command format (`{"command": ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
|
||||
- **Transaction submitter and RPC load generator** both use xrpld's native WebSocket command format (`{"command": ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
|
||||
- **Node config** requires `[signing_support] true` for server-side signing, and `[ips]` (not `[ips_fixed]`) to ensure peer connections count in `Peer_Finder_Active_*` metrics.
|
||||
- **Metric validation** uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric in `expected_metrics.json` must have > 0 series.
|
||||
- **StatsD gauge fix**: `StatsDGaugeImpl` initializes `m_dirty = true` so all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g. `jobq_job_count`) would be invisible in Prometheus.
|
||||
@@ -871,13 +871,13 @@ See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown
|
||||
|
||||
The validation suite (`validate_telemetry.py`) runs exactly 71 checks, broken down as:
|
||||
|
||||
- **1 service registration** — `rippled` exists in Tempo
|
||||
- **1 service registration** — `xrpld` exists in Tempo
|
||||
- **17 span existence** — `rpc.request`, `rpc.process`, `rpc.ws_message`, `rpc.command.*`, `tx.process`, `tx.receive`, `tx.apply`, `consensus.proposal.send`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`, `ledger.build`, `ledger.validate`, `ledger.store`, `peer.proposal.receive`, `peer.validation.receive`
|
||||
- **14 span attribute** — required attributes on the 14 spans that define them (22 unique attributes total)
|
||||
- **2 span hierarchies** — `rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply` (1 skipped: `rpc.request` -> `rpc.process`, cross-thread)
|
||||
- **1 span duration bounds** — all spans > 0 and < 60 s
|
||||
- **26 metric existence** — 4 SpanMetrics (`traces_span_metrics_calls_total`, `..._duration_milliseconds_{bucket,count,sum}`), 6 StatsD gauges (`LedgerMaster_Validated_Ledger_Age`, `Published_Ledger_Age`, `State_Accounting_Full_duration`, `Peer_Finder_Active_{Inbound,Outbound}_Peers`, `jobq_job_count`), 2 StatsD counters (`rpc_requests_total`, `ledger_fetches_total`), 3 StatsD histograms (`rpc_time`, `rpc_size`, `ios_latency`), 4 overlay traffic (`total_Bytes_{In,Out}`, `total_Messages_{In,Out}`), 7 Phase 9 OTLP (`nodestore_state`, `cache_metrics`, `txq_metrics`, `rpc_method_{started,finished}_total`, `object_count`, `load_factor_metrics`)
|
||||
- **10 dashboard loads** — `rippled-rpc-perf`, `rippled-transactions`, `rippled-consensus`, `rippled-ledger-ops`, `rippled-peer-net`, `rippled-system-node-health`, `rippled-system-network`, `rippled-system-rpc`, `rippled-system-overlay-detail`, `rippled-system-ledger-sync`
|
||||
- **10 dashboard loads** — `xrpld-rpc-perf`, `xrpld-transactions`, `xrpld-consensus`, `xrpld-ledger-ops`, `xrpld-peer-net`, `xrpld-system-node-health`, `xrpld-system-network`, `xrpld-system-rpc`, `xrpld-system-overlay-detail`, `xrpld-system-ledger-sync`
|
||||
|
||||
See [Phase10_taskList.md](./Phase10_taskList.md) for the full numbered check-by-check enumeration.
|
||||
|
||||
|
||||
@@ -446,40 +446,40 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h
|
||||
> **Upstream**: Phase 7 Tasks 7.9-7.16 (metrics), Phase 9 Tasks 9.11-9.13 (dashboards).
|
||||
> **Downstream**: None — terminal task in the parity chain.
|
||||
|
||||
**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `rippled_*` internal metrics.
|
||||
**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `xrpld_*` internal metrics.
|
||||
|
||||
**Critical Group** (8 rules, eval interval 10s):
|
||||
|
||||
| Rule | Condition | For |
|
||||
| ------------------- | --------------------------------------------------------------- | --- |
|
||||
| Agreement Below 90% | `rippled_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s |
|
||||
| Not Proposing | `rippled_state_tracking{metric="state_value"} < 6` | 10s |
|
||||
| Unhealthy State | `rippled_state_tracking{metric="state_value"} < 4` | 10s |
|
||||
| Amendment Blocked | `rippled_validator_health{metric="amendment_blocked"} == 1` | 1m |
|
||||
| UNL Expiring | `rippled_validator_health{metric="unl_expiry_days"} < 14` | 1h |
|
||||
| High IO Latency | `histogram_quantile(0.95, rippled_ios_latency_bucket) > 50` | 1m |
|
||||
| High Load Factor | `rippled_load_factor_metrics{metric="load_factor"} > 1000` | 1m |
|
||||
| Peer Count Critical | `rippled_server_info{metric="peers"} < 5` | 1m |
|
||||
| Rule | Condition | For |
|
||||
| ------------------- | ------------------------------------------------------------- | --- |
|
||||
| Agreement Below 90% | `xrpld_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s |
|
||||
| Not Proposing | `xrpld_state_tracking{metric="state_value"} < 6` | 10s |
|
||||
| Unhealthy State | `xrpld_state_tracking{metric="state_value"} < 4` | 10s |
|
||||
| Amendment Blocked | `xrpld_validator_health{metric="amendment_blocked"} == 1` | 1m |
|
||||
| UNL Expiring | `xrpld_validator_health{metric="unl_expiry_days"} < 14` | 1h |
|
||||
| High IO Latency | `histogram_quantile(0.95, xrpld_ios_latency_bucket) > 50` | 1m |
|
||||
| High Load Factor | `xrpld_load_factor_metrics{metric="load_factor"} > 1000` | 1m |
|
||||
| Peer Count Critical | `xrpld_server_info{metric="peers"} < 5` | 1m |
|
||||
|
||||
**Network Group** (3 rules, eval interval 10s):
|
||||
|
||||
| Rule | Condition | For |
|
||||
| ------------------------- | ------------------------------------------------------------------- | --- |
|
||||
| Peer Drop >10% | `delta(rippled_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s |
|
||||
| Peer Drop >30% | Same formula, threshold -30 | 30s |
|
||||
| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m |
|
||||
| Rule | Condition | For |
|
||||
| ------------------------- | ----------------------------------------------------------------- | --- |
|
||||
| Peer Drop >10% | `delta(xrpld_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s |
|
||||
| Peer Drop >30% | Same formula, threshold -30 | 30s |
|
||||
| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m |
|
||||
|
||||
**Performance Group** (7 rules, eval interval 10s):
|
||||
|
||||
| Rule | Condition | For |
|
||||
| ------------------- | -------------------------------------------------------------- | --- |
|
||||
| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m |
|
||||
| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m |
|
||||
| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m |
|
||||
| Job Queue Overflow | `rate(rippled_jq_trans_overflow_total[5m]) > 0` | 1m |
|
||||
| Upgrade Recommended | `rippled_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m |
|
||||
| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m |
|
||||
| Stale Ledger | `rippled_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m |
|
||||
| Rule | Condition | For |
|
||||
| ------------------- | ------------------------------------------------------------ | --- |
|
||||
| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m |
|
||||
| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m |
|
||||
| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m |
|
||||
| Job Queue Overflow | `rate(xrpld_jq_trans_overflow_total[5m]) > 0` | 1m |
|
||||
| Upgrade Recommended | `xrpld_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m |
|
||||
| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m |
|
||||
| Stale Ledger | `xrpld_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m |
|
||||
|
||||
**Notification channel templates**: Email/SMTP, Discord, Slack, PagerDuty.
|
||||
|
||||
@@ -507,13 +507,13 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h
|
||||
|
||||
**Use case**: Real-time state panels (server state, ledger age, peer count) where 10-15s latency is too slow for operational dashboards.
|
||||
|
||||
**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to rippled would only be needed if sub-second metric freshness is required from the internal metrics pipeline.
|
||||
**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to xrpld would only be needed if sub-second metric freshness is required from the internal metrics pipeline.
|
||||
|
||||
**What to document**:
|
||||
|
||||
- Architecture comparison: OTLP pipeline (10-15s) vs. direct scrape (2-5s) vs. push gateway
|
||||
- When to consider: operator feedback indicating 10s is insufficient for alerting SLOs
|
||||
- How to implement if needed: add `/metrics` HTTP endpoint to rippled with Prometheus client library
|
||||
- How to implement if needed: add `/metrics` HTTP endpoint to xrpld with Prometheus client library
|
||||
- Trade-offs: additional port, additional dependency, duplication with OTLP metrics
|
||||
|
||||
**Key files**:
|
||||
|
||||
@@ -127,10 +127,10 @@ These metrics serve multiple external consumer categories identified during rese
|
||||
**What to do**:
|
||||
|
||||
- Register OTel instruments for PerfLog RPC counters (from `PerfLogImp.cpp` line ~63):
|
||||
- Counter: `rippled_rpc_method_started_total{method="<name>"}` — calls started
|
||||
- Counter: `rippled_rpc_method_finished_total{method="<name>"}` — calls completed
|
||||
- Counter: `rippled_rpc_method_errored_total{method="<name>"}` — calls errored
|
||||
- Histogram: `rippled_rpc_method_duration_us{method="<name>"}` — execution time distribution
|
||||
- Counter: `xrpld_rpc_method_started_total{method="<name>"}` — calls started
|
||||
- Counter: `xrpld_rpc_method_finished_total{method="<name>"}` — calls completed
|
||||
- Counter: `xrpld_rpc_method_errored_total{method="<name>"}` — calls errored
|
||||
- Histogram: `xrpld_rpc_method_duration_us{method="<name>"}` — execution time distribution
|
||||
|
||||
- Use OTel `Counter<int64_t>` and `Histogram<double>` instruments with `method` attribute label.
|
||||
|
||||
@@ -154,11 +154,11 @@ These metrics serve multiple external consumer categories identified during rese
|
||||
**What to do**:
|
||||
|
||||
- Register OTel instruments for PerfLog job counters:
|
||||
- Counter: `rippled_job_queued_total{job_type="<name>"}` — jobs queued
|
||||
- Counter: `rippled_job_started_total{job_type="<name>"}` — jobs started
|
||||
- Counter: `rippled_job_finished_total{job_type="<name>"}` — jobs completed
|
||||
- Histogram: `rippled_job_queued_duration_us{job_type="<name>"}` — time spent waiting in queue
|
||||
- Histogram: `rippled_job_running_duration_us{job_type="<name>"}` — execution time distribution
|
||||
- Counter: `xrpld_job_queued_total{job_type="<name>"}` — jobs queued
|
||||
- Counter: `xrpld_job_started_total{job_type="<name>"}` — jobs started
|
||||
- Counter: `xrpld_job_finished_total{job_type="<name>"}` — jobs completed
|
||||
- Histogram: `xrpld_job_queued_duration_us{job_type="<name>"}` — time spent waiting in queue
|
||||
- Histogram: `xrpld_job_running_duration_us{job_type="<name>"}` — execution time distribution
|
||||
|
||||
- Hook into PerfLog's existing job tracking alongside Task 9.4.
|
||||
|
||||
@@ -180,15 +180,15 @@ These metrics serve multiple external consumer categories identified during rese
|
||||
**What to do**:
|
||||
|
||||
- Register OTel `ObservableGauge` callbacks for `CountedObject<T>` instance counts:
|
||||
- `rippled_object_count{type="Transaction"}` — live Transaction objects
|
||||
- `rippled_object_count{type="Ledger"}` — live Ledger objects
|
||||
- `rippled_object_count{type="NodeObject"}` — live NodeObject instances
|
||||
- `rippled_object_count{type="STTx"}` — serialized transaction objects
|
||||
- `rippled_object_count{type="STLedgerEntry"}` — serialized ledger entries
|
||||
- `rippled_object_count{type="InboundLedger"}` — ledgers being fetched
|
||||
- `rippled_object_count{type="Pathfinder"}` — active pathfinding computations
|
||||
- `rippled_object_count{type="PathRequest"}` — active path requests
|
||||
- `rippled_object_count{type="HashRouterEntry"}` — hash router entries
|
||||
- `xrpld_object_count{type="Transaction"}` — live Transaction objects
|
||||
- `xrpld_object_count{type="Ledger"}` — live Ledger objects
|
||||
- `xrpld_object_count{type="NodeObject"}` — live NodeObject instances
|
||||
- `xrpld_object_count{type="STTx"}` — serialized transaction objects
|
||||
- `xrpld_object_count{type="STLedgerEntry"}` — serialized ledger entries
|
||||
- `xrpld_object_count{type="InboundLedger"}` — ledgers being fetched
|
||||
- `xrpld_object_count{type="Pathfinder"}` — active pathfinding computations
|
||||
- `xrpld_object_count{type="PathRequest"}` — active path requests
|
||||
- `xrpld_object_count{type="HashRouterEntry"}` — hash router entries
|
||||
|
||||
- The `CountedObject` template already tracks these via atomic counters. The callback just reads the current counts.
|
||||
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
# Telemetry Workload Tools
|
||||
|
||||
Synthetic workload generation and validation tools for rippled's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.
|
||||
Synthetic workload generation and validation tools for xrpld's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Build rippled with telemetry enabled
|
||||
# Build xrpld with telemetry enabled
|
||||
conan install . --build=missing -o telemetry=True
|
||||
cmake --preset default -Dtelemetry=ON
|
||||
cmake --build --preset default
|
||||
@@ -19,7 +19,7 @@ docker/telemetry/workload/run-full-validation.sh --cleanup
|
||||
|
||||
## Architecture
|
||||
|
||||
The validation suite runs a multi-node rippled cluster as local processes alongside
|
||||
The validation suite runs a multi-node xrpld cluster as local processes alongside
|
||||
a Docker Compose telemetry stack. The cluster exercises consensus, peer-to-peer
|
||||
spans (proposals, validations), and all metric pipelines.
|
||||
|
||||
@@ -108,7 +108,7 @@ Custom `"weights"` override the default command/transaction distribution.
|
||||
|
||||
### run-full-validation.sh
|
||||
|
||||
Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node rippled cluster, generates load, and validates the results.
|
||||
Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node xrpld cluster, generates load, and validates the results.
|
||||
|
||||
```bash
|
||||
# Full validation with defaults (uses full-validation profile)
|
||||
@@ -146,7 +146,7 @@ python3 workload_orchestrator.py --profile stress --report /tmp/report.json
|
||||
### rpc_load_generator.py
|
||||
|
||||
Generates RPC traffic matching realistic production distribution. Uses
|
||||
rippled's **native WebSocket command format** (`{"command": ...}`) with flat
|
||||
xrpld's **native WebSocket command format** (`{"command": ...}`) with flat
|
||||
parameters — the same format as `tx_submitter.py`.
|
||||
|
||||
- 40% health checks (server_info, fee)
|
||||
@@ -172,7 +172,7 @@ python3 rpc_load_generator.py --endpoints ws://localhost:6006 \
|
||||
### tx_submitter.py
|
||||
|
||||
Submits diverse transaction types to exercise the full span and metric surface.
|
||||
Uses rippled's **native WebSocket command format** (`{"command": ...}`) rather
|
||||
Uses xrpld's **native WebSocket command format** (`{"command": ...}`) rather
|
||||
than JSON-RPC format. The response payload is inside the `"result"` key, with
|
||||
`"status"` at the top level.
|
||||
|
||||
@@ -310,7 +310,7 @@ Categories:
|
||||
The validation runs as a GitHub Actions workflow (`.github/workflows/telemetry-validation.yml`):
|
||||
|
||||
- Triggered manually or on pushes to telemetry branches
|
||||
- Builds rippled, starts the full stack, runs load, validates
|
||||
- Builds xrpld, starts the full stack, runs load, validates
|
||||
- Uploads reports as artifacts
|
||||
- Posts summary to PR
|
||||
|
||||
|
||||
@@ -16,11 +16,11 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191
|
||||
**Symptoms:**
|
||||
|
||||
```
|
||||
[FAIL] metric.statsd_gauges.rippled_LedgerMaster_Validated_Ledger_Age: 0 series
|
||||
[FAIL] metric.statsd_counters.rippled_rpc_requests: 0 series
|
||||
[FAIL] metric.statsd_histograms.rippled_rpc_time: 0 series
|
||||
[FAIL] metric.overlay_traffic.rippled_total_Bytes_In: 0 series
|
||||
[FAIL] metric.phase9_nodestore.rippled_nodestore_reads_total: 0 series
|
||||
[FAIL] metric.statsd_gauges.xrpld_LedgerMaster_Validated_Ledger_Age: 0 series
|
||||
[FAIL] metric.statsd_counters.xrpld_rpc_requests: 0 series
|
||||
[FAIL] metric.statsd_histograms.xrpld_rpc_time: 0 series
|
||||
[FAIL] metric.overlay_traffic.xrpld_total_Bytes_In: 0 series
|
||||
[FAIL] metric.phase9_nodestore.xrpld_nodestore_reads_total: 0 series
|
||||
... (25 total)
|
||||
```
|
||||
|
||||
@@ -32,7 +32,7 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191
|
||||
the validation harness configures xrpld nodes with `server=statsd`.
|
||||
|
||||
2. **Metric name mismatch:** The `expected_metrics.json` expects StatsD-style metric
|
||||
names (e.g., `rippled_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`,
|
||||
names (e.g., `xrpld_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`,
|
||||
beast::insight emits OTLP metrics which may have different names/structure.
|
||||
|
||||
**Fix Options (pick one):**
|
||||
@@ -127,24 +127,24 @@ individual checks), but the parent-child relationship isn't established.
|
||||
**Symptoms:**
|
||||
|
||||
```
|
||||
[FAIL] dashboard.rippled-statsd-node-health: HTTP 404
|
||||
[FAIL] dashboard.rippled-statsd-network: HTTP 404
|
||||
[FAIL] dashboard.rippled-statsd-rpc: HTTP 404
|
||||
[FAIL] dashboard.rippled-statsd-overlay-detail: HTTP 404
|
||||
[FAIL] dashboard.rippled-statsd-ledger-sync: HTTP 404
|
||||
[FAIL] dashboard.xrpld-statsd-node-health: HTTP 404
|
||||
[FAIL] dashboard.xrpld-statsd-network: HTTP 404
|
||||
[FAIL] dashboard.xrpld-statsd-rpc: HTTP 404
|
||||
[FAIL] dashboard.xrpld-statsd-overlay-detail: HTTP 404
|
||||
[FAIL] dashboard.xrpld-statsd-ledger-sync: HTTP 404
|
||||
```
|
||||
|
||||
**Root Cause:** Dashboard UIDs were renamed from `rippled-statsd-*` to `rippled-system-*`
|
||||
**Root Cause:** Dashboard UIDs were renamed from `xrpld-statsd-*` to `xrpld-system-*`
|
||||
but `expected_metrics.json` still references the old names.
|
||||
|
||||
**Actual UIDs in `docker/telemetry/grafana/dashboards/`:**
|
||||
| Expected (in expected_metrics.json) | Actual (in dashboard JSON) |
|
||||
|-------------------------------------|-------------------------------|
|
||||
| `rippled-statsd-node-health` | `rippled-system-node-health` |
|
||||
| `rippled-statsd-network` | `rippled-system-network` |
|
||||
| `rippled-statsd-rpc` | `rippled-system-rpc` |
|
||||
| `rippled-statsd-overlay-detail` | `rippled-system-overlay-detail` |
|
||||
| `rippled-statsd-ledger-sync` | `rippled-system-ledger-sync` |
|
||||
| `xrpld-statsd-node-health` | `xrpld-system-node-health` |
|
||||
| `xrpld-statsd-network` | `xrpld-system-network` |
|
||||
| `xrpld-statsd-rpc` | `xrpld-system-rpc` |
|
||||
| `xrpld-statsd-overlay-detail` | `xrpld-system-overlay-detail` |
|
||||
| `xrpld-statsd-ledger-sync` | `xrpld-system-ledger-sync` |
|
||||
|
||||
**Fix:** Update the 5 UIDs in `expected_metrics.json` → `grafana_dashboards.uids[]`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user