rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-06-07 02:36:47 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	fd1c8c6060	fix(telemetry): resolve Phase 10 validation failures surfaced by first full CI run The print-env CI fix let the Telemetry Stack Validation job build and run the workload harness end-to-end for the first time. It reported 129/136 checks passing; this commit fixes the 7 real failures plus a latent regression-gate bug. Validation-suite fixes (verified against the CI run's actual emission + live node): - expected_metrics.json: the beast::insight job-depth gauge is `xrpld_jobq_job_count`, not `xrpld_job_count` (the latter is a Phase 9 OTel counter). Reverted the prior rename. Removed the statsd_histograms block (`xrpld_rpc_time`/`xrpld_rpc_size`): these RPC timers do not emit under the WS workload (0 series in CI). - expected_spans.json: `tx_status` is only set on suppressed/known-bad receives, so it is no longer a required attribute of every `tx.receive`. Marked `pathfind.compute` and `pathfind.discover` optional and the `pathfind.request -> pathfind.compute` hierarchy as skip — the self-to-self XRP probe returns before computing paths in a fresh cluster with no liquidity, so only `pathfind.request` fires. Regression-gate bug (telemetry-validation.yml "Print regression summary"): - `jq -e` exits non-zero when its filter result is boolean false — the normal case for a populated (non-placeholder) baseline — which was misreported as "Failed to parse baseline JSON" and failed the job. Dropped `-e` (kept `-r`) so a non-zero exit genuinely means malformed JSON. The optional-span handling and regression comparison both worked correctly in the CI run (txq.* / pathfind.update_all skipped-when-absent, 0 regressions detected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:26:30 +01:00
Pratik Mankawde	cb9fce6890	fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI The Phase 10 validation harness had drifted from the code's recording surface and the telemetry-validation CI job was failing before it could build. CI fix (telemetry-validation.yml): - Replace nonexistent local action ./.github/actions/print-env with the remote XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this). - Sync prepare-runner and upload-artifact action SHAs to the canonical workflow. Recording-surface reconciliation (docker/telemetry/workload/): - Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id, ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...). Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared constant), while consensus.validation.send uses bare ledger_hash. - Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq); ledger.build carries ledger_seq/close_* (not tx_count/tx_failed). - Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop the never-emitted duration_ms; rebuild the parent-child map accordingly. - Add the new spans the code emits: apply-pipeline stage spans (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq., consensus sub-spans (round/establish/update_positions/check/phase.open), ledger.acquire, grpc., pathfind.. Conditional spans are marked optional so they are skipped (not failed) when the workload does not exercise them. - validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span attrs); add optional-span handling that skips missing optional spans while still validating attributes when present. - expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics, xrpld_job_count, the 15 on-disk xrpld- dashboard UIDs, and the real bare spanmetrics dimension labels. - regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message. Metrics pipeline fix: - Switch node [insight] config from server=statsd/prefix=rippled to server=otel + /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh, xrpld-validator.cfg.template, benchmark.sh and the workload compose. The collector has no StatsD receiver, so system metrics only reach Prometheus over OTLP. Synthetic load for new spans: - Add ripple_path_find to the RPC load generator (drives pathfind.* spans). - Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.). All facts verified against the SpanNames.h headers and a live xrpld node + collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type], 279 xrpld_ Prometheus metrics and zero rippled_). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 17:08:58 +01:00
Pratik Mankawde	592e546f82	fix(telemetry): align Phase 10 workload configs with xrpld_ metric prefix Phase 10's workload validation configs (expected_metrics.json, regression-metrics.json, validate_telemetry.py) queried the MetricsRegistry metrics under the rippled_ prefix, but MetricsRegistry emits them as xrpld_ (see MetricsRegistry.cpp). On a live run the workload validator reported every MetricsRegistry metric as missing, masking genuine regressions. Rename the following to xrpld_ across the workload validator, expected-metrics manifest, and regression-metrics template: - nodestore_state, cache_metrics, txq_metrics, load_factor_metrics, object_count - rpc_method_started_total / _finished_total / _errored_total / _duration_us - job_queued_total / _started_total / _finished_total / _queued_duration_us_bucket / _running_duration_us_bucket - peer_quality, server_info, validator_health, ledger_economy, db_metrics, complete_ledgers, build_info, state_tracking, storage_detail - ledgers_closed_total, validations_sent_total, validations_checked_total, state_changes_total - validation_agreement, validation_agreements_total, validation_missed_total Mirrors the phase-9 fix in commit `5601615952`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-13 15:01:13 +01:00
Pratik Mankawde	ff1502f939	feat(telemetry): add workload orchestrator with phased load profiles Add a profile-driven workload orchestrator that executes sequential load phases with configurable RPC rates and TX throughput. Three profiles: full-validation (6 phases covering all 18 dashboards), quick-smoke (CI), and stress (benchmarking). Fix 10 validation failures: correct Phase 9 metric prefixes, relax peer latency bounds for localhost clusters, and allow sub-microsecond span durations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	e63ca4c495	fix(telemetry): fix dashboard UID and add parity attributes to expected_spans - Remove duplicate 'system-node-health' UID from expected_metrics.json (already covered by 'rippled-system-node-health') - Add parity span attributes to expected_spans.json: node health on rpc.command.*, validation hash/full on consensus.validation.send, quorum/proposers on consensus.accept, validation hash/full on peer.validation.receive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	711ae43174	feat(telemetry): add external dashboard parity validation checks (Task 10.8) Add ~28 validation checks for external dashboard parity: - 8 span attribute checks (server_info, tx.receive, consensus, peer spans) - 13 metric existence checks (validation agreement, validator health, peer quality, ledger economy, state tracking, counters, storage) - 3 dashboard load checks (validator-health, peer-quality, system-node-health) - 4 value sanity checks (agreement %, UNL expiry, latency, state value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	5de8c520d1	Phase 10: Workload validation - synthetic load generation and telemetry checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00

7 Commits