Phase 10's workload validation configs (expected_metrics.json,
regression-metrics.json, validate_telemetry.py) queried the
MetricsRegistry metrics under the rippled_ prefix, but MetricsRegistry
emits them as xrpld_ (see MetricsRegistry.cpp). On a live run the
workload validator reported every MetricsRegistry metric as missing,
masking genuine regressions.
Rename the following to xrpld_ across the workload validator,
expected-metrics manifest, and regression-metrics template:
- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
object_count
- rpc_method_started_total / _finished_total / _errored_total /
_duration_us
- job_queued_total / _started_total / _finished_total /
_queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
db_metrics, complete_ledgers, build_info, state_tracking,
storage_detail
- ledgers_closed_total, validations_sent_total,
validations_checked_total, state_changes_total
- validation_agreement, validation_agreements_total,
validation_missed_total
Mirrors the phase-9 fix in commit 5601615952.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Separate local declarations from assignments to avoid hiding errors,
and use [[ instead of [ for non-POSIX comparisons.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Populate baselines/baseline-timings.json from the green CI run
(24906110133, commit f11ebc1253). 25/31 metrics have non-null values;
6 span.rpc.* are null due to sparse data in the 3m window.
Remove the rpc_methods section from regression-metrics.json and its
thresholds. rippled_rpc_method_duration_us_bucket is never populated
because PerfLogImp::rpcEnd never calls MetricsRegistry::recordRpcFinished
— only recordRpcStarted is wired up (Phase 9 instrumentation gap).
The span-based rpc.request/rpc.process metrics via spanmetrics already
cover RPC latency.
Two CI failures traced to root cause:
1. rippled_jobq_job_count: 0 series — StatsDGaugeImpl declared
m_dirty{false} despite the constructor comment saying "start dirty".
Gauges whose value starts and stays at 0 never emitted, so Prometheus
never scraped them. Fix: m_dirty{true} on the member initializer.
2. TX error rate 82.8% — the submitter tracked account sequences
locally, but in a multi-node consensus network other nodes' txns
advance sequences independently. After a few ledger closes the
locally-tracked sequence fell behind the ledger, producing
tefPAST_SEQ for every subsequent submission. Fix: refresh account
sequences from account_info every 10 s during the submission loop.
- capture_timings.py: fail when captured/total ratio < 50%
(--min-capture-ratio). Prevents silent pass on unreachable Prometheus.
- run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so
the final exit code reflects it. Update exit code docs in header.
- compare_to_baseline.py: extract _skip_delta helper to bring
compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound
resolution (use explicit None check instead of `or`). Remove dead
variable override_prefix_key.
- prom_queries.py: extract _build_simple_entries and _build_job_entries
to bring build_query_plan under 80 lines. Fix module docstring return
type example. Use aiohttp.ClientTimeout instead of bare int.
- telemetry-validation.yml: add set -euo pipefail to regression summary
step; guard jq calls with -e flag and fallback; fail on missing
baseline file; emit ::warning annotation when timings.json missing.
- baselines/README.md: document the placeholder field.
Captures per-span / per-RPC / per-job timings from Prometheus after the
workload run and diffs them against a committed baseline. Regression
requires breaching both a percentage and an absolute bound, tolerating
small-value noise. When the baseline is a placeholder, the comparator
emits the captured JSON in the exact schema for one-time paste into
baselines/baseline-timings.json, and the CI Step Summary surfaces that
block for the reviewer.
Scope: gate only — automated baseline persistence, benchmark.sh
PromQL migration, and the historical trend dashboard remain follow-ups.
Migrate validate_telemetry.py to Tempo TraceQL search API, remove
Jaeger service from workload docker-compose, update readiness checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove duplicate 'system-node-health' UID from expected_metrics.json
(already covered by 'rippled-system-node-health')
- Add parity span attributes to expected_spans.json: node health on
rpc.command.*, validation hash/full on consensus.validation.send,
quorum/proposers on consensus.accept, validation hash/full on
peer.validation.receive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>