rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-27 00:50:45 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	d83cb0bdb3	fix(telemetry): refresh regression baseline + widen bucket-noise thresholds With validation now passing 133/133, the only remaining job failure was the regression gate flagging 4 timing "regressions". Two compounding causes: 1. Stale baseline: the committed baseline was captured (2026-04-24) under the old, lighter workload — before the new txq-burst phase (60 TPS) existed. The heavier per-ledger work genuinely raises ledger.build / tx.apply / ledger.validate / acceptLedger timings, so every run regressed against it. Refreshed the baseline from the latest CI-measured timings (same workload). 2. Histogram quantization: SpanMetrics latency buckets are [1,5,10,25,...]ms, so a sub-millisecond quantile near a low-end boundary can jump a full bucket (1ms->5ms) between runs with no real change. The old absolute bounds (2-5ms) were narrower than one bucket width, so that jitter tripped the gate. Widened the default span bounds to 10-15ms (~2 low-end buckets) and pct to 50%, and the job_queue running bound to 20ms, to tolerate quantization noise while still catching genuine multi-bucket regressions. The consensus.* overrides (tight pct, large abs) are unchanged. The refreshed baseline also picks up real rpc.ws_message timings (previously null under the phantom rpc.request key). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:58:07 +01:00
Pratik Mankawde	cb9fce6890	fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI The Phase 10 validation harness had drifted from the code's recording surface and the telemetry-validation CI job was failing before it could build. CI fix (telemetry-validation.yml): - Replace nonexistent local action ./.github/actions/print-env with the remote XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this). - Sync prepare-runner and upload-artifact action SHAs to the canonical workflow. Recording-surface reconciliation (docker/telemetry/workload/): - Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id, ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...). Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared constant), while consensus.validation.send uses bare ledger_hash. - Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq); ledger.build carries ledger_seq/close_* (not tx_count/tx_failed). - Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop the never-emitted duration_ms; rebuild the parent-child map accordingly. - Add the new spans the code emits: apply-pipeline stage spans (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq., consensus sub-spans (round/establish/update_positions/check/phase.open), ledger.acquire, grpc., pathfind.. Conditional spans are marked optional so they are skipped (not failed) when the workload does not exercise them. - validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span attrs); add optional-span handling that skips missing optional spans while still validating attributes when present. - expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics, xrpld_job_count, the 15 on-disk xrpld- dashboard UIDs, and the real bare spanmetrics dimension labels. - regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message. Metrics pipeline fix: - Switch node [insight] config from server=statsd/prefix=rippled to server=otel + /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh, xrpld-validator.cfg.template, benchmark.sh and the workload compose. The collector has no StatsD receiver, so system metrics only reach Prometheus over OTLP. Synthetic load for new spans: - Add ripple_path_find to the RPC load generator (drives pathfind.* spans). - Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.). All facts verified against the SpanNames.h headers and a live xrpld node + collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type], 279 xrpld_ Prometheus metrics and zero rippled_). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 17:08:58 +01:00
Pratik Mankawde	dc13e9d680	fix: populate baseline from CI run, remove dead rpc_methods metrics Populate baselines/baseline-timings.json from the green CI run (24906110133, commit `f11ebc1253`). 25/31 metrics have non-null values; 6 span.rpc.* are null due to sparse data in the 3m window. Remove the rpc_methods section from regression-metrics.json and its thresholds. rippled_rpc_method_duration_us_bucket is never populated because PerfLogImp::rpcEnd never calls MetricsRegistry::recordRpcFinished — only recordRpcStarted is wired up (Phase 9 instrumentation gap). The span-based rpc.request/rpc.process metrics via spanmetrics already cover RPC latency.	2026-04-24 20:08:52 +01:00
Pratik Mankawde	577d1f8a21	fix: address review findings in regression gate - capture_timings.py: fail when captured/total ratio < 50% (--min-capture-ratio). Prevents silent pass on unreachable Prometheus. - run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so the final exit code reflects it. Update exit code docs in header. - compare_to_baseline.py: extract _skip_delta helper to bring compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound resolution (use explicit None check instead of `or`). Remove dead variable override_prefix_key. - prom_queries.py: extract _build_simple_entries and _build_job_entries to bring build_query_plan under 80 lines. Fix module docstring return type example. Use aiohttp.ClientTimeout instead of bare int. - telemetry-validation.yml: add set -euo pipefail to regression summary step; guard jq calls with -e flag and fallback; fail on missing baseline file; emit ::warning annotation when timings.json missing. - baselines/README.md: document the placeholder field.	2026-04-24 19:36:15 +01:00
Pratik Mankawde	df79d5e74b	feat: add OTel-driven regression gate for Phase 10 telemetry validation Captures per-span / per-RPC / per-job timings from Prometheus after the workload run and diffs them against a committed baseline. Regression requires breaching both a percentage and an absolute bound, tolerating small-value noise. When the baseline is a placeholder, the comparator emits the captured JSON in the exact schema for one-time paste into baselines/baseline-timings.json, and the CI Step Summary surfaces that block for the reviewer. Scope: gate only — automated baseline persistence, benchmark.sh PromQL migration, and the historical trend dashboard remain follow-ups.	2026-04-24 18:53:44 +01:00

5 Commits