Two CI failures traced to root cause:
1. rippled_jobq_job_count: 0 series — StatsDGaugeImpl declared
m_dirty{false} despite the constructor comment saying "start dirty".
Gauges whose value starts and stays at 0 never emitted, so Prometheus
never scraped them. Fix: m_dirty{true} on the member initializer.
2. TX error rate 82.8% — the submitter tracked account sequences
locally, but in a multi-node consensus network other nodes' txns
advance sequences independently. After a few ledger closes the
locally-tracked sequence fell behind the ledger, producing
tefPAST_SEQ for every subsequent submission. Fix: refresh account
sequences from account_info every 10 s during the submission loop.
- capture_timings.py: fail when captured/total ratio < 50%
(--min-capture-ratio). Prevents silent pass on unreachable Prometheus.
- run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so
the final exit code reflects it. Update exit code docs in header.
- compare_to_baseline.py: extract _skip_delta helper to bring
compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound
resolution (use explicit None check instead of `or`). Remove dead
variable override_prefix_key.
- prom_queries.py: extract _build_simple_entries and _build_job_entries
to bring build_query_plan under 80 lines. Fix module docstring return
type example. Use aiohttp.ClientTimeout instead of bare int.
- telemetry-validation.yml: add set -euo pipefail to regression summary
step; guard jq calls with -e flag and fallback; fail on missing
baseline file; emit ::warning annotation when timings.json missing.
- baselines/README.md: document the placeholder field.
Captures per-span / per-RPC / per-job timings from Prometheus after the
workload run and diffs them against a committed baseline. Regression
requires breaching both a percentage and an absolute bound, tolerating
small-value noise. When the baseline is a placeholder, the comparator
emits the captured JSON in the exact schema for one-time paste into
baselines/baseline-timings.json, and the CI Step Summary surfaces that
block for the reviewer.
Scope: gate only — automated baseline persistence, benchmark.sh
PromQL migration, and the historical trend dashboard remain follow-ups.
Migrate validate_telemetry.py to Tempo TraceQL search API, remove
Jaeger service from workload docker-compose, update readiness checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove duplicate 'system-node-health' UID from expected_metrics.json
(already covered by 'rippled-system-node-health')
- Add parity span attributes to expected_spans.json: node health on
rpc.command.*, validation hash/full on consensus.validation.send,
quorum/proposers on consensus.accept, validation hash/full on
peer.validation.receive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>