diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index f3d581cc37..00a71a25c6 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -846,7 +846,7 @@ flowchart LR ### Key Implementation Details -- **Transaction submitter and RPC load generator** both use rippled's native WebSocket command format (`{"command": ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level. +- **Transaction submitter and RPC load generator** both use xrpld's native WebSocket command format (`{"command": ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level. - **Node config** requires `[signing_support] true` for server-side signing, and `[ips]` (not `[ips_fixed]`) to ensure peer connections count in `Peer_Finder_Active_*` metrics. - **Metric validation** uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric in `expected_metrics.json` must have > 0 series. - **StatsD gauge fix**: `StatsDGaugeImpl` initializes `m_dirty = true` so all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g. `jobq_job_count`) would be invisible in Prometheus. @@ -871,13 +871,13 @@ See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown The validation suite (`validate_telemetry.py`) runs exactly 71 checks, broken down as: -- **1 service registration** — `rippled` exists in Tempo +- **1 service registration** — `xrpld` exists in Tempo - **17 span existence** — `rpc.request`, `rpc.process`, `rpc.ws_message`, `rpc.command.*`, `tx.process`, `tx.receive`, `tx.apply`, `consensus.proposal.send`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`, `ledger.build`, `ledger.validate`, `ledger.store`, `peer.proposal.receive`, `peer.validation.receive` - **14 span attribute** — required attributes on the 14 spans that define them (22 unique attributes total) - **2 span hierarchies** — `rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply` (1 skipped: `rpc.request` -> `rpc.process`, cross-thread) - **1 span duration bounds** — all spans > 0 and < 60 s - **26 metric existence** — 4 SpanMetrics (`traces_span_metrics_calls_total`, `..._duration_milliseconds_{bucket,count,sum}`), 6 StatsD gauges (`LedgerMaster_Validated_Ledger_Age`, `Published_Ledger_Age`, `State_Accounting_Full_duration`, `Peer_Finder_Active_{Inbound,Outbound}_Peers`, `jobq_job_count`), 2 StatsD counters (`rpc_requests_total`, `ledger_fetches_total`), 3 StatsD histograms (`rpc_time`, `rpc_size`, `ios_latency`), 4 overlay traffic (`total_Bytes_{In,Out}`, `total_Messages_{In,Out}`), 7 Phase 9 OTLP (`nodestore_state`, `cache_metrics`, `txq_metrics`, `rpc_method_{started,finished}_total`, `object_count`, `load_factor_metrics`) -- **10 dashboard loads** — `rippled-rpc-perf`, `rippled-transactions`, `rippled-consensus`, `rippled-ledger-ops`, `rippled-peer-net`, `rippled-system-node-health`, `rippled-system-network`, `rippled-system-rpc`, `rippled-system-overlay-detail`, `rippled-system-ledger-sync` +- **10 dashboard loads** — `xrpld-rpc-perf`, `xrpld-transactions`, `xrpld-consensus`, `xrpld-ledger-ops`, `xrpld-peer-net`, `xrpld-system-node-health`, `xrpld-system-network`, `xrpld-system-rpc`, `xrpld-system-overlay-detail`, `xrpld-system-ledger-sync` See [Phase10_taskList.md](./Phase10_taskList.md) for the full numbered check-by-check enumeration. diff --git a/OpenTelemetryPlan/Phase11_taskList.md b/OpenTelemetryPlan/Phase11_taskList.md index 41ad1cd165..5124b14edb 100644 --- a/OpenTelemetryPlan/Phase11_taskList.md +++ b/OpenTelemetryPlan/Phase11_taskList.md @@ -446,40 +446,40 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h > **Upstream**: Phase 7 Tasks 7.9-7.16 (metrics), Phase 9 Tasks 9.11-9.13 (dashboards). > **Downstream**: None — terminal task in the parity chain. -**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `rippled_*` internal metrics. +**Objective**: Add Grafana alerting rules for the Phase 7+ parity metrics (validation agreement, validator health, peer quality, state tracking, ledger economy). These complement Task 11.8's `xrpl_*` alerts by covering the `xrpld_*` internal metrics. **Critical Group** (8 rules, eval interval 10s): -| Rule | Condition | For | -| ------------------- | --------------------------------------------------------------- | --- | -| Agreement Below 90% | `rippled_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s | -| Not Proposing | `rippled_state_tracking{metric="state_value"} < 6` | 10s | -| Unhealthy State | `rippled_state_tracking{metric="state_value"} < 4` | 10s | -| Amendment Blocked | `rippled_validator_health{metric="amendment_blocked"} == 1` | 1m | -| UNL Expiring | `rippled_validator_health{metric="unl_expiry_days"} < 14` | 1h | -| High IO Latency | `histogram_quantile(0.95, rippled_ios_latency_bucket) > 50` | 1m | -| High Load Factor | `rippled_load_factor_metrics{metric="load_factor"} > 1000` | 1m | -| Peer Count Critical | `rippled_server_info{metric="peers"} < 5` | 1m | +| Rule | Condition | For | +| ------------------- | ------------------------------------------------------------- | --- | +| Agreement Below 90% | `xrpld_validation_agreement{metric="agreement_pct_24h"} < 90` | 30s | +| Not Proposing | `xrpld_state_tracking{metric="state_value"} < 6` | 10s | +| Unhealthy State | `xrpld_state_tracking{metric="state_value"} < 4` | 10s | +| Amendment Blocked | `xrpld_validator_health{metric="amendment_blocked"} == 1` | 1m | +| UNL Expiring | `xrpld_validator_health{metric="unl_expiry_days"} < 14` | 1h | +| High IO Latency | `histogram_quantile(0.95, xrpld_ios_latency_bucket) > 50` | 1m | +| High Load Factor | `xrpld_load_factor_metrics{metric="load_factor"} > 1000` | 1m | +| Peer Count Critical | `xrpld_server_info{metric="peers"} < 5` | 1m | **Network Group** (3 rules, eval interval 10s): -| Rule | Condition | For | -| ------------------------- | ------------------------------------------------------------------- | --- | -| Peer Drop >10% | `delta(rippled_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s | -| Peer Drop >30% | Same formula, threshold -30 | 30s | -| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m | +| Rule | Condition | For | +| ------------------------- | ----------------------------------------------------------------- | --- | +| Peer Drop >10% | `delta(xrpld_server_info{metric="peers"}[30s]) / ... * 100 < -10` | 30s | +| Peer Drop >30% | Same formula, threshold -30 | 30s | +| P90 Latency + Disconnects | `peer_latency_p90_ms > 500 AND rate(disconnects) > 0` | 2m | **Performance Group** (7 rules, eval interval 10s): -| Rule | Condition | For | -| ------------------- | -------------------------------------------------------------- | --- | -| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m | -| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m | -| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m | -| Job Queue Overflow | `rate(rippled_jq_trans_overflow_total[5m]) > 0` | 1m | -| Upgrade Recommended | `rippled_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m | -| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m | -| Stale Ledger | `rippled_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m | +| Rule | Condition | For | +| ------------------- | ------------------------------------------------------------ | --- | +| CPU High | Per-core CPU > 80% (requires node_exporter) | 2m | +| Memory Critical | Memory usage > 90% (requires node_exporter) | 1m | +| Disk Warning | Disk usage > 85% (requires node_exporter) | 2m | +| Job Queue Overflow | `rate(xrpld_jq_trans_overflow_total[5m]) > 0` | 1m | +| Upgrade Recommended | `xrpld_peer_quality{metric="peers_higher_version_pct"} > 60` | 1m | +| TX Rate Drop | Transaction rate dropped > 50% in 5m window | 5m | +| Stale Ledger | `xrpld_ledger_economy{metric="ledger_age_seconds"} > 30` | 1m | **Notification channel templates**: Email/SMTP, Discord, Slack, PagerDuty. @@ -507,13 +507,13 @@ This phase addresses the cross-cutting gap identified during research: **xrpld h **Use case**: Real-time state panels (server state, ledger age, peer count) where 10-15s latency is too slow for operational dashboards. -**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to rippled would only be needed if sub-second metric freshness is required from the internal metrics pipeline. +**Decision**: Document as a future option, not implement now. The current 10s interval is acceptable for v1. The external dashboard achieves 2-5s freshness by polling RPC directly, which is what the Phase 11 receiver already does. Adding a separate scrape endpoint to xrpld would only be needed if sub-second metric freshness is required from the internal metrics pipeline. **What to document**: - Architecture comparison: OTLP pipeline (10-15s) vs. direct scrape (2-5s) vs. push gateway - When to consider: operator feedback indicating 10s is insufficient for alerting SLOs -- How to implement if needed: add `/metrics` HTTP endpoint to rippled with Prometheus client library +- How to implement if needed: add `/metrics` HTTP endpoint to xrpld with Prometheus client library - Trade-offs: additional port, additional dependency, duplication with OTLP metrics **Key files**: diff --git a/OpenTelemetryPlan/Phase9_taskList.md b/OpenTelemetryPlan/Phase9_taskList.md index 07af1ddef3..b43c1270e9 100644 --- a/OpenTelemetryPlan/Phase9_taskList.md +++ b/OpenTelemetryPlan/Phase9_taskList.md @@ -127,10 +127,10 @@ These metrics serve multiple external consumer categories identified during rese **What to do**: - Register OTel instruments for PerfLog RPC counters (from `PerfLogImp.cpp` line ~63): - - Counter: `rippled_rpc_method_started_total{method=""}` — calls started - - Counter: `rippled_rpc_method_finished_total{method=""}` — calls completed - - Counter: `rippled_rpc_method_errored_total{method=""}` — calls errored - - Histogram: `rippled_rpc_method_duration_us{method=""}` — execution time distribution + - Counter: `xrpld_rpc_method_started_total{method=""}` — calls started + - Counter: `xrpld_rpc_method_finished_total{method=""}` — calls completed + - Counter: `xrpld_rpc_method_errored_total{method=""}` — calls errored + - Histogram: `xrpld_rpc_method_duration_us{method=""}` — execution time distribution - Use OTel `Counter` and `Histogram` instruments with `method` attribute label. @@ -154,11 +154,11 @@ These metrics serve multiple external consumer categories identified during rese **What to do**: - Register OTel instruments for PerfLog job counters: - - Counter: `rippled_job_queued_total{job_type=""}` — jobs queued - - Counter: `rippled_job_started_total{job_type=""}` — jobs started - - Counter: `rippled_job_finished_total{job_type=""}` — jobs completed - - Histogram: `rippled_job_queued_duration_us{job_type=""}` — time spent waiting in queue - - Histogram: `rippled_job_running_duration_us{job_type=""}` — execution time distribution + - Counter: `xrpld_job_queued_total{job_type=""}` — jobs queued + - Counter: `xrpld_job_started_total{job_type=""}` — jobs started + - Counter: `xrpld_job_finished_total{job_type=""}` — jobs completed + - Histogram: `xrpld_job_queued_duration_us{job_type=""}` — time spent waiting in queue + - Histogram: `xrpld_job_running_duration_us{job_type=""}` — execution time distribution - Hook into PerfLog's existing job tracking alongside Task 9.4. @@ -180,15 +180,15 @@ These metrics serve multiple external consumer categories identified during rese **What to do**: - Register OTel `ObservableGauge` callbacks for `CountedObject` instance counts: - - `rippled_object_count{type="Transaction"}` — live Transaction objects - - `rippled_object_count{type="Ledger"}` — live Ledger objects - - `rippled_object_count{type="NodeObject"}` — live NodeObject instances - - `rippled_object_count{type="STTx"}` — serialized transaction objects - - `rippled_object_count{type="STLedgerEntry"}` — serialized ledger entries - - `rippled_object_count{type="InboundLedger"}` — ledgers being fetched - - `rippled_object_count{type="Pathfinder"}` — active pathfinding computations - - `rippled_object_count{type="PathRequest"}` — active path requests - - `rippled_object_count{type="HashRouterEntry"}` — hash router entries + - `xrpld_object_count{type="Transaction"}` — live Transaction objects + - `xrpld_object_count{type="Ledger"}` — live Ledger objects + - `xrpld_object_count{type="NodeObject"}` — live NodeObject instances + - `xrpld_object_count{type="STTx"}` — serialized transaction objects + - `xrpld_object_count{type="STLedgerEntry"}` — serialized ledger entries + - `xrpld_object_count{type="InboundLedger"}` — ledgers being fetched + - `xrpld_object_count{type="Pathfinder"}` — active pathfinding computations + - `xrpld_object_count{type="PathRequest"}` — active path requests + - `xrpld_object_count{type="HashRouterEntry"}` — hash router entries - The `CountedObject` template already tracks these via atomic counters. The callback just reads the current counts. diff --git a/docker/telemetry/workload/README.md b/docker/telemetry/workload/README.md index f1aa1d2720..5f28cf42e1 100644 --- a/docker/telemetry/workload/README.md +++ b/docker/telemetry/workload/README.md @@ -1,11 +1,11 @@ # Telemetry Workload Tools -Synthetic workload generation and validation tools for rippled's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load. +Synthetic workload generation and validation tools for xrpld's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load. ## Quick Start ```bash -# Build rippled with telemetry enabled +# Build xrpld with telemetry enabled conan install . --build=missing -o telemetry=True cmake --preset default -Dtelemetry=ON cmake --build --preset default @@ -19,7 +19,7 @@ docker/telemetry/workload/run-full-validation.sh --cleanup ## Architecture -The validation suite runs a multi-node rippled cluster as local processes alongside +The validation suite runs a multi-node xrpld cluster as local processes alongside a Docker Compose telemetry stack. The cluster exercises consensus, peer-to-peer spans (proposals, validations), and all metric pipelines. @@ -108,7 +108,7 @@ Custom `"weights"` override the default command/transaction distribution. ### run-full-validation.sh -Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node rippled cluster, generates load, and validates the results. +Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node xrpld cluster, generates load, and validates the results. ```bash # Full validation with defaults (uses full-validation profile) @@ -146,7 +146,7 @@ python3 workload_orchestrator.py --profile stress --report /tmp/report.json ### rpc_load_generator.py Generates RPC traffic matching realistic production distribution. Uses -rippled's **native WebSocket command format** (`{"command": ...}`) with flat +xrpld's **native WebSocket command format** (`{"command": ...}`) with flat parameters — the same format as `tx_submitter.py`. - 40% health checks (server_info, fee) @@ -172,7 +172,7 @@ python3 rpc_load_generator.py --endpoints ws://localhost:6006 \ ### tx_submitter.py Submits diverse transaction types to exercise the full span and metric surface. -Uses rippled's **native WebSocket command format** (`{"command": ...}`) rather +Uses xrpld's **native WebSocket command format** (`{"command": ...}`) rather than JSON-RPC format. The response payload is inside the `"result"` key, with `"status"` at the top level. @@ -310,7 +310,7 @@ Categories: The validation runs as a GitHub Actions workflow (`.github/workflows/telemetry-validation.yml`): - Triggered manually or on pushes to telemetry branches -- Builds rippled, starts the full stack, runs load, validates +- Builds xrpld, starts the full stack, runs load, validates - Uploads reports as artifacts - Posts summary to PR diff --git a/tasks/fix-validation-checks.md b/tasks/fix-validation-checks.md index 44cacc102a..bdfd58abc7 100644 --- a/tasks/fix-validation-checks.md +++ b/tasks/fix-validation-checks.md @@ -16,11 +16,11 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191 **Symptoms:** ``` -[FAIL] metric.statsd_gauges.rippled_LedgerMaster_Validated_Ledger_Age: 0 series -[FAIL] metric.statsd_counters.rippled_rpc_requests: 0 series -[FAIL] metric.statsd_histograms.rippled_rpc_time: 0 series -[FAIL] metric.overlay_traffic.rippled_total_Bytes_In: 0 series -[FAIL] metric.phase9_nodestore.rippled_nodestore_reads_total: 0 series +[FAIL] metric.statsd_gauges.xrpld_LedgerMaster_Validated_Ledger_Age: 0 series +[FAIL] metric.statsd_counters.xrpld_rpc_requests: 0 series +[FAIL] metric.statsd_histograms.xrpld_rpc_time: 0 series +[FAIL] metric.overlay_traffic.xrpld_total_Bytes_In: 0 series +[FAIL] metric.phase9_nodestore.xrpld_nodestore_reads_total: 0 series ... (25 total) ``` @@ -32,7 +32,7 @@ CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191 the validation harness configures xrpld nodes with `server=statsd`. 2. **Metric name mismatch:** The `expected_metrics.json` expects StatsD-style metric - names (e.g., `rippled_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`, + names (e.g., `xrpld_LedgerMaster_Validated_Ledger_Age`). When using `server=otel`, beast::insight emits OTLP metrics which may have different names/structure. **Fix Options (pick one):** @@ -127,24 +127,24 @@ individual checks), but the parent-child relationship isn't established. **Symptoms:** ``` -[FAIL] dashboard.rippled-statsd-node-health: HTTP 404 -[FAIL] dashboard.rippled-statsd-network: HTTP 404 -[FAIL] dashboard.rippled-statsd-rpc: HTTP 404 -[FAIL] dashboard.rippled-statsd-overlay-detail: HTTP 404 -[FAIL] dashboard.rippled-statsd-ledger-sync: HTTP 404 +[FAIL] dashboard.xrpld-statsd-node-health: HTTP 404 +[FAIL] dashboard.xrpld-statsd-network: HTTP 404 +[FAIL] dashboard.xrpld-statsd-rpc: HTTP 404 +[FAIL] dashboard.xrpld-statsd-overlay-detail: HTTP 404 +[FAIL] dashboard.xrpld-statsd-ledger-sync: HTTP 404 ``` -**Root Cause:** Dashboard UIDs were renamed from `rippled-statsd-*` to `rippled-system-*` +**Root Cause:** Dashboard UIDs were renamed from `xrpld-statsd-*` to `xrpld-system-*` but `expected_metrics.json` still references the old names. **Actual UIDs in `docker/telemetry/grafana/dashboards/`:** | Expected (in expected_metrics.json) | Actual (in dashboard JSON) | |-------------------------------------|-------------------------------| -| `rippled-statsd-node-health` | `rippled-system-node-health` | -| `rippled-statsd-network` | `rippled-system-network` | -| `rippled-statsd-rpc` | `rippled-system-rpc` | -| `rippled-statsd-overlay-detail` | `rippled-system-overlay-detail` | -| `rippled-statsd-ledger-sync` | `rippled-system-ledger-sync` | +| `xrpld-statsd-node-health` | `xrpld-system-node-health` | +| `xrpld-statsd-network` | `xrpld-system-network` | +| `xrpld-statsd-rpc` | `xrpld-system-rpc` | +| `xrpld-statsd-overlay-detail` | `xrpld-system-overlay-detail` | +| `xrpld-statsd-ledger-sync` | `xrpld-system-ledger-sync` | **Fix:** Update the 5 UIDs in `expected_metrics.json` → `grafana_dashboards.uids[]`.