fix: address CI rename checks (rippled -> xrpld) in phase-10 docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-04-29 20:40:44 +01:00
parent 70d86d7ebf
commit b659d43395
5 changed files with 72 additions and 72 deletions

View File

@@ -846,7 +846,7 @@ flowchart LR
### Key Implementation Details
- **Transaction submitter and RPC load generator** both use rippled's native WebSocket command format (`{"command": ...}`) not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
- **Transaction submitter and RPC load generator** both use xrpld's native WebSocket command format (`{"command": ...}`) not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
- **Node config** requires `[signing_support] true` for server-side signing, and `[ips]` (not `[ips_fixed]`) to ensure peer connections count in `Peer_Finder_Active_*` metrics.
- **Metric validation** uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric in `expected_metrics.json` must have > 0 series.
- **StatsD gauge fix**: `StatsDGaugeImpl` initializes `m_dirty = true` so all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g. `jobq_job_count`) would be invisible in Prometheus.
@@ -871,13 +871,13 @@ See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown
The validation suite (`validate_telemetry.py`) runs exactly 71 checks, broken down as:
- **1 service registration** — `rippled` exists in Tempo
- **1 service registration** — `xrpld` exists in Tempo
- **17 span existence** — `rpc.request`, `rpc.process`, `rpc.ws_message`, `rpc.command.*`, `tx.process`, `tx.receive`, `tx.apply`, `consensus.proposal.send`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`, `ledger.build`, `ledger.validate`, `ledger.store`, `peer.proposal.receive`, `peer.validation.receive`
- **14 span attribute** — required attributes on the 14 spans that define them (22 unique attributes total)
- **2 span hierarchies** — `rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply` (1 skipped: `rpc.request` -> `rpc.process`, cross-thread)
- **1 span duration bounds** — all spans > 0 and < 60 s
- **26 metric existence** 4 SpanMetrics (`traces_span_metrics_calls_total`, `..._duration_milliseconds_{bucket,count,sum}`), 6 StatsD gauges (`LedgerMaster_Validated_Ledger_Age`, `Published_Ledger_Age`, `State_Accounting_Full_duration`, `Peer_Finder_Active_{Inbound,Outbound}_Peers`, `jobq_job_count`), 2 StatsD counters (`rpc_requests_total`, `ledger_fetches_total`), 3 StatsD histograms (`rpc_time`, `rpc_size`, `ios_latency`), 4 overlay traffic (`total_Bytes_{In,Out}`, `total_Messages_{In,Out}`), 7 Phase 9 OTLP (`nodestore_state`, `cache_metrics`, `txq_metrics`, `rpc_method_{started,finished}_total`, `object_count`, `load_factor_metrics`)
- **10 dashboard loads** `rippled-rpc-perf`, `rippled-transactions`, `rippled-consensus`, `rippled-ledger-ops`, `rippled-peer-net`, `rippled-system-node-health`, `rippled-system-network`, `rippled-system-rpc`, `rippled-system-overlay-detail`, `rippled-system-ledger-sync`
- **10 dashboard loads** `xrpld-rpc-perf`, `xrpld-transactions`, `xrpld-consensus`, `xrpld-ledger-ops`, `xrpld-peer-net`, `xrpld-system-node-health`, `xrpld-system-network`, `xrpld-system-rpc`, `xrpld-system-overlay-detail`, `xrpld-system-ledger-sync`
See [Phase10_taskList.md](./Phase10_taskList.md) for the full numbered check-by-check enumeration.

View File

@@ -446,40 +446,40 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h
> **Upstream**: Phase 7 Tasks 7.9-7.16 (metrics), Phase 9 Tasks 9.11-9.13 (dashboards).
> **Downstream**: None — terminal task in the parity chain.
**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `rippled_*` internal metrics.
**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `xrpld_*` internal metrics.
**Critical Group** (8 rules, eval interval 10s):
| Rule | Condition | For |
| ------------------- | --------------------------------------------------------------- | --- |
| Agreement Below 90% | `rippled_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s |
| Not Proposing | `rippled_state_tracking{metric="state_value"} < 6` | 10s |
| Unhealthy State | `rippled_state_tracking{metric="state_value"} < 4` | 10s |
| Amendment Blocked | `rippled_validator_health{metric="amendment_blocked"} == 1` | 1m |
| UNL Expiring | `rippled_validator_health{metric="unl_expiry_days"} < 14` | 1h |
| High IO Latency | `histogram_quantile(0.95, rippled_ios_latency_bucket) > 50` | 1m |
| High Load Factor | `rippled_load_factor_metrics{metric="load_factor"} > 1000` | 1m |
| Peer Count Critical | `rippled_server_info{metric="peers"} < 5` | 1m |
| Rule | Condition | For |
| ------------------- | ------------------------------------------------------------- | --- |
| Agreement Below 90% | `xrpld_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s |
| Not Proposing | `xrpld_state_tracking{metric="state_value"} < 6` | 10s |
| Unhealthy State | `xrpld_state_tracking{metric="state_value"} < 4` | 10s |
| Amendment Blocked | `xrpld_validator_health{metric="amendment_blocked"} == 1` | 1m |
| UNL Expiring | `xrpld_validator_health{metric="unl_expiry_days"} < 14` | 1h |
| High IO Latency | `histogram_quantile(0.95, xrpld_ios_latency_bucket) > 50` | 1m |
| High Load Factor | `xrpld_load_factor_metrics{metric="load_factor"} > 1000` | 1m |
| Peer Count Critical | `xrpld_server_info{metric="peers"} < 5` | 1m |
**Network Group** (3 rules, eval interval 10s):
| Rule | Condition | For |
| ------------------------- | ------------------------------------------------------------------- | --- |
| Peer Drop >10% | `delta(rippled_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s |
| Peer Drop >30% | Same formula, threshold -30 | 30s |
| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m |
| Rule | Condition | For |
| ------------------------- | ----------------------------------------------------------------- | --- |
| Peer Drop >10% | `delta(xrpld_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s |
| Peer Drop >30% | Same formula, threshold -30 | 30s |
| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m |
**Performance Group** (7 rules, eval interval 10s):
| Rule | Condition | For |
| ------------------- | -------------------------------------------------------------- | --- |
| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m |
| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m |
| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m |
| Job Queue Overflow | `rate(rippled_jq_trans_overflow_total[5m]) > 0` | 1m |
| Upgrade Recommended | `rippled_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m |
| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m |
| Stale Ledger | `rippled_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m |
| Rule | Condition | For |
| ------------------- | ------------------------------------------------------------ | --- |
| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m |
| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m |
| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m |
| Job Queue Overflow | `rate(xrpld_jq_trans_overflow_total[5m]) > 0` | 1m |
| Upgrade Recommended | `xrpld_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m |
| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m |
| Stale Ledger | `xrpld_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m |
**Notification channel templates**: Email/SMTP, Discord, Slack, PagerDuty.
@@ -507,13 +507,13 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h
**Use case**: Real-time state panels (server state, ledger age, peer count) where 10-15s latency is too slow for operational dashboards.
**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to rippled would only be needed if sub-second metric freshness is required from the internal metrics pipeline.
**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to xrpld would only be needed if sub-second metric freshness is required from the internal metrics pipeline.
**What to document**:
- Architecture comparison: OTLP pipeline (10-15s) vs. direct scrape (2-5s) vs. push gateway
- When to consider: operator feedback indicating 10s is insufficient for alerting SLOs
- How to implement if needed: add `/metrics` HTTP endpoint to rippled with Prometheus client library
- How to implement if needed: add `/metrics` HTTP endpoint to xrpld with Prometheus client library
- Trade-offs: additional port, additional dependency, duplication with OTLP metrics
**Key files**:

View File

@@ -127,10 +127,10 @@ These metrics serve multiple external consumer categories identified during rese
**What to do**:
- Register OTel instruments for PerfLog RPC counters (from `PerfLogImp.cpp` line ~63):
- Counter: `rippled_rpc_method_started_total{method="<name>"}` — calls started
- Counter: `rippled_rpc_method_finished_total{method="<name>"}` — calls completed
- Counter: `rippled_rpc_method_errored_total{method="<name>"}` — calls errored
- Histogram: `rippled_rpc_method_duration_us{method="<name>"}` — execution time distribution
- Counter: `xrpld_rpc_method_started_total{method="<name>"}` — calls started
- Counter: `xrpld_rpc_method_finished_total{method="<name>"}` — calls completed
- Counter: `xrpld_rpc_method_errored_total{method="<name>"}` — calls errored
- Histogram: `xrpld_rpc_method_duration_us{method="<name>"}` — execution time distribution
- Use OTel `Counter<int64_t>` and `Histogram<double>` instruments with `method` attribute label.
@@ -154,11 +154,11 @@ These metrics serve multiple external consumer categories identified during rese
**What to do**:
- Register OTel instruments for PerfLog job counters:
- Counter: `rippled_job_queued_total{job_type="<name>"}` — jobs queued
- Counter: `rippled_job_started_total{job_type="<name>"}` — jobs started
- Counter: `rippled_job_finished_total{job_type="<name>"}` — jobs completed
- Histogram: `rippled_job_queued_duration_us{job_type="<name>"}` — time spent waiting in queue
- Histogram: `rippled_job_running_duration_us{job_type="<name>"}` — execution time distribution
- Counter: `xrpld_job_queued_total{job_type="<name>"}` — jobs queued
- Counter: `xrpld_job_started_total{job_type="<name>"}` — jobs started
- Counter: `xrpld_job_finished_total{job_type="<name>"}` — jobs completed
- Histogram: `xrpld_job_queued_duration_us{job_type="<name>"}` — time spent waiting in queue
- Histogram: `xrpld_job_running_duration_us{job_type="<name>"}` — execution time distribution
- Hook into PerfLog's existing job tracking alongside Task 9.4.
@@ -180,15 +180,15 @@ These metrics serve multiple external consumer categories identified during rese
**What to do**:
- Register OTel `ObservableGauge` callbacks for `CountedObject<T>` instance counts:
- `rippled_object_count{type="Transaction"}` — live Transaction objects
- `rippled_object_count{type="Ledger"}` — live Ledger objects
- `rippled_object_count{type="NodeObject"}` — live NodeObject instances
- `rippled_object_count{type="STTx"}` — serialized transaction objects
- `rippled_object_count{type="STLedgerEntry"}` — serialized ledger entries
- `rippled_object_count{type="InboundLedger"}` — ledgers being fetched
- `rippled_object_count{type="Pathfinder"}` — active pathfinding computations
- `rippled_object_count{type="PathRequest"}` — active path requests
- `rippled_object_count{type="HashRouterEntry"}` — hash router entries
- `xrpld_object_count{type="Transaction"}` — live Transaction objects
- `xrpld_object_count{type="Ledger"}` — live Ledger objects
- `xrpld_object_count{type="NodeObject"}` — live NodeObject instances
- `xrpld_object_count{type="STTx"}` — serialized transaction objects
- `xrpld_object_count{type="STLedgerEntry"}` — serialized ledger entries
- `xrpld_object_count{type="InboundLedger"}` — ledgers being fetched
- `xrpld_object_count{type="Pathfinder"}` — active pathfinding computations
- `xrpld_object_count{type="PathRequest"}` — active path requests
- `xrpld_object_count{type="HashRouterEntry"}` — hash router entries
- The `CountedObject` template already tracks these via atomic counters. The callback just reads the current counts.

View File

@@ -1,11 +1,11 @@
# Telemetry Workload Tools
Synthetic workload generation and validation tools for rippled's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.
Synthetic workload generation and validation tools for xrpld's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.
## Quick Start
```bash
# Build rippled with telemetry enabled
# Build xrpld with telemetry enabled
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default
@@ -19,7 +19,7 @@ docker/telemetry/workload/run-full-validation.sh --cleanup
## Architecture
The validation suite runs a multi-node rippled cluster as local processes alongside
The validation suite runs a multi-node xrpld cluster as local processes alongside
a Docker Compose telemetry stack. The cluster exercises consensus, peer-to-peer
spans (proposals, validations), and all metric pipelines.
@@ -108,7 +108,7 @@ Custom `"weights"` override the default command/transaction distribution.
### run-full-validation.sh
Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node rippled cluster, generates load, and validates the results.
Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node xrpld cluster, generates load, and validates the results.
```bash
# Full validation with defaults (uses full-validation profile)
@@ -146,7 +146,7 @@ python3 workload_orchestrator.py --profile stress --report /tmp/report.json
### rpc_load_generator.py
Generates RPC traffic matching realistic production distribution. Uses
rippled's **native WebSocket command format** (`{"command": ...}`) with flat
xrpld's **native WebSocket command format** (`{"command": ...}`) with flat
parameters — the same format as `tx_submitter.py`.
- 40% health checks (server_info, fee)
@@ -172,7 +172,7 @@ python3 rpc_load_generator.py --endpoints ws://localhost:6006 \
### tx_submitter.py
Submits diverse transaction types to exercise the full span and metric surface.
Uses rippled's **native WebSocket command format** (`{"command": ...}`) rather
Uses xrpld's **native WebSocket command format** (`{"command": ...}`) rather
than JSON-RPC format. The response payload is inside the `"result"` key, with
`"status"` at the top level.
@@ -310,7 +310,7 @@ Categories:
The validation runs as a GitHub Actions workflow (`.github/workflows/telemetry-validation.yml`):
- Triggered manually or on pushes to telemetry branches
- Builds rippled, starts the full stack, runs load, validates
- Builds xrpld, starts the full stack, runs load, validates
- Uploads reports as artifacts
- Posts summary to PR

View File

@@ -16,11 +16,11 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191
**Symptoms:**
```
[FAIL] metric.statsd_gauges.rippled_LedgerMaster_Validated_Ledger_Age: 0 series
[FAIL] metric.statsd_counters.rippled_rpc_requests: 0 series
[FAIL] metric.statsd_histograms.rippled_rpc_time: 0 series
[FAIL] metric.overlay_traffic.rippled_total_Bytes_In: 0 series
[FAIL] metric.phase9_nodestore.rippled_nodestore_reads_total: 0 series
[FAIL] metric.statsd_gauges.xrpld_LedgerMaster_Validated_Ledger_Age: 0 series
[FAIL] metric.statsd_counters.xrpld_rpc_requests: 0 series
[FAIL] metric.statsd_histograms.xrpld_rpc_time: 0 series
[FAIL] metric.overlay_traffic.xrpld_total_Bytes_In: 0 series
[FAIL] metric.phase9_nodestore.xrpld_nodestore_reads_total: 0 series
... (25 total)
```
@@ -32,7 +32,7 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191
the validation harness configures xrpld nodes with `server=statsd`.
2. **Metric name mismatch:** The `expected_metrics.json` expects StatsD-style metric
names (e.g., `rippled_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`,
names (e.g., `xrpld_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`,
beast::insight emits OTLP metrics which may have different names/structure.
**Fix Options (pick one):**
@@ -127,24 +127,24 @@ individual checks), but the parent-child relationship isn't established.
**Symptoms:**
```
[FAIL] dashboard.rippled-statsd-node-health: HTTP 404
[FAIL] dashboard.rippled-statsd-network: HTTP 404
[FAIL] dashboard.rippled-statsd-rpc: HTTP 404
[FAIL] dashboard.rippled-statsd-overlay-detail: HTTP 404
[FAIL] dashboard.rippled-statsd-ledger-sync: HTTP 404
[FAIL] dashboard.xrpld-statsd-node-health: HTTP 404
[FAIL] dashboard.xrpld-statsd-network: HTTP 404
[FAIL] dashboard.xrpld-statsd-rpc: HTTP 404
[FAIL] dashboard.xrpld-statsd-overlay-detail: HTTP 404
[FAIL] dashboard.xrpld-statsd-ledger-sync: HTTP 404
```
**Root Cause:** Dashboard UIDs were renamed from `rippled-statsd-*` to `rippled-system-*`
**Root Cause:** Dashboard UIDs were renamed from `xrpld-statsd-*` to `xrpld-system-*`
but `expected_metrics.json` still references the old names.
**Actual UIDs in `docker/telemetry/grafana/dashboards/`:**
| Expected (in expected_metrics.json) | Actual (in dashboard JSON) |
|-------------------------------------|-------------------------------|
| `rippled-statsd-node-health` | `rippled-system-node-health` |
| `rippled-statsd-network` | `rippled-system-network` |
| `rippled-statsd-rpc` | `rippled-system-rpc` |
| `rippled-statsd-overlay-detail` | `rippled-system-overlay-detail` |
| `rippled-statsd-ledger-sync` | `rippled-system-ledger-sync` |
| `xrpld-statsd-node-health` | `xrpld-system-node-health` |
| `xrpld-statsd-network` | `xrpld-system-network` |
| `xrpld-statsd-rpc` | `xrpld-system-rpc` |
| `xrpld-statsd-overlay-detail` | `xrpld-system-overlay-detail` |
| `xrpld-statsd-ledger-sync` | `xrpld-system-ledger-sync` |
**Fix:** Update the 5 UIDs in `expected_metrics.json``grafana_dashboards.uids[]`.