With validation now passing 133/133, the only remaining job failure was the
regression gate flagging 4 timing "regressions". Two compounding causes:
1. Stale baseline: the committed baseline was captured (2026-04-24) under the
old, lighter workload — before the new txq-burst phase (60 TPS) existed. The
heavier per-ledger work genuinely raises ledger.build / tx.apply /
ledger.validate / acceptLedger timings, so every run regressed against it.
Refreshed the baseline from the latest CI-measured timings (same workload).
2. Histogram quantization: SpanMetrics latency buckets are
[1,5,10,25,...]ms, so a sub-millisecond quantile near a low-end boundary can
jump a full bucket (1ms->5ms) between runs with no real change. The old
absolute bounds (2-5ms) were narrower than one bucket width, so that jitter
tripped the gate. Widened the default span bounds to 10-15ms (~2 low-end
buckets) and pct to 50%, and the job_queue running bound to 20ms, to tolerate
quantization noise while still catching genuine multi-bucket regressions. The
consensus.* overrides (tight pct, large abs) are unchanged.
The refreshed baseline also picks up real rpc.ws_message timings (previously null
under the phantom rpc.request key).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Phase 10 validation harness had drifted from the code's recording surface
and the telemetry-validation CI job was failing before it could build.
CI fix (telemetry-validation.yml):
- Replace nonexistent local action ./.github/actions/print-env with the remote
XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this).
- Sync prepare-runner and upload-artifact action SHAs to the canonical workflow.
Recording-surface reconciliation (docker/telemetry/workload/):
- Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore
form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id,
ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...).
Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared
constant), while consensus.validation.send uses bare ledger_hash.
- Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq);
ledger.build carries ledger_seq/close_* (not tx_count/tx_failed).
- Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop
the never-emitted duration_ms; rebuild the parent-child map accordingly.
- Add the new spans the code emits: apply-pipeline stage spans
(tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq.*,
consensus sub-spans (round/establish/update_positions/check/phase.open),
ledger.acquire, grpc.*, pathfind.*. Conditional spans are marked optional so
they are skipped (not failed) when the workload does not exercise them.
- validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix
PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span
attrs); add optional-span handling that skips missing optional spans while still
validating attributes when present.
- expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics,
xrpld_job_count, the 15 on-disk xrpld-* dashboard UIDs, and the real bare
spanmetrics dimension labels.
- regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message.
Metrics pipeline fix:
- Switch node [insight] config from server=statsd/prefix=rippled to server=otel +
/v1/metrics endpoint + prefix=xrpld across run-full-validation.sh,
xrpld-validator.cfg.template, benchmark.sh and the workload compose. The
collector has no StatsD receiver, so system metrics only reach Prometheus over
OTLP.
Synthetic load for new spans:
- Add ripple_path_find to the RPC load generator (drives pathfind.* spans).
- Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.*).
All facts verified against the *SpanNames.h headers and a live xrpld node +
collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type],
279 xrpld_ Prometheus metrics and zero rippled_).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Populate baselines/baseline-timings.json from the green CI run
(24906110133, commit f11ebc1253). 25/31 metrics have non-null values;
6 span.rpc.* are null due to sparse data in the 3m window.
Remove the rpc_methods section from regression-metrics.json and its
thresholds. rippled_rpc_method_duration_us_bucket is never populated
because PerfLogImp::rpcEnd never calls MetricsRegistry::recordRpcFinished
— only recordRpcStarted is wired up (Phase 9 instrumentation gap).
The span-based rpc.request/rpc.process metrics via spanmetrics already
cover RPC latency.
- capture_timings.py: fail when captured/total ratio < 50%
(--min-capture-ratio). Prevents silent pass on unreachable Prometheus.
- run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so
the final exit code reflects it. Update exit code docs in header.
- compare_to_baseline.py: extract _skip_delta helper to bring
compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound
resolution (use explicit None check instead of `or`). Remove dead
variable override_prefix_key.
- prom_queries.py: extract _build_simple_entries and _build_job_entries
to bring build_query_plan under 80 lines. Fix module docstring return
type example. Use aiohttp.ClientTimeout instead of bare int.
- telemetry-validation.yml: add set -euo pipefail to regression summary
step; guard jq calls with -e flag and fallback; fail on missing
baseline file; emit ::warning annotation when timings.json missing.
- baselines/README.md: document the placeholder field.
Captures per-span / per-RPC / per-job timings from Prometheus after the
workload run and diffs them against a committed baseline. Regression
requires breaching both a percentage and an absolute bound, tolerating
small-value noise. When the baseline is a placeholder, the comparator
emits the captured JSON in the exact schema for one-time paste into
baselines/baseline-timings.json, and the CI Step Summary surfaces that
block for the reviewer.
Scope: gate only — automated baseline persistence, benchmark.sh
PromQL migration, and the historical trend dashboard remain follow-ups.