rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-24 07:30:30 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	815e2b1f5d	refactor(telemetry): fix remaining old attr refs in tests, docs, workload - Update Telemetry.h doc example: xrpl.rpc.command -> command. - Update SpanGuardFactory.cpp test: use new bare attr names. - Update TESTING.md: rename attr refs in span table + PromQL example. - Update expected_spans.json: all attrs match simplified naming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:21:18 +01:00
Pratik Mankawde	592e546f82	fix(telemetry): align Phase 10 workload configs with xrpld_ metric prefix Phase 10's workload validation configs (expected_metrics.json, regression-metrics.json, validate_telemetry.py) queried the MetricsRegistry metrics under the rippled_ prefix, but MetricsRegistry emits them as xrpld_ (see MetricsRegistry.cpp). On a live run the workload validator reported every MetricsRegistry metric as missing, masking genuine regressions. Rename the following to xrpld_ across the workload validator, expected-metrics manifest, and regression-metrics template: - nodestore_state, cache_metrics, txq_metrics, load_factor_metrics, object_count - rpc_method_started_total / _finished_total / _errored_total / _duration_us - job_queued_total / _started_total / _finished_total / _queued_duration_us_bucket / _running_duration_us_bucket - peer_quality, server_info, validator_health, ledger_economy, db_metrics, complete_ledgers, build_info, state_tracking, storage_detail - ledgers_closed_total, validations_sent_total, validations_checked_total, state_changes_total - validation_agreement, validation_agreements_total, validation_missed_total Mirrors the phase-9 fix in commit `5601615952`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-13 15:01:13 +01:00
Pratik Mankawde	8e44c95d6a	fix: address bashate warnings in benchmark.sh (E042/E044) Separate local declarations from assignments to avoid hiding errors, and use [[ instead of [ for non-POSIX comparisons. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:42:01 +01:00
Pratik Mankawde	b659d43395	fix: address CI rename checks (rippled -> xrpld) in phase-10 docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:40:44 +01:00
Pratik Mankawde	dc13e9d680	fix: populate baseline from CI run, remove dead rpc_methods metrics Populate baselines/baseline-timings.json from the green CI run (24906110133, commit `f11ebc1253`). 25/31 metrics have non-null values; 6 span.rpc.* are null due to sparse data in the 3m window. Remove the rpc_methods section from regression-metrics.json and its thresholds. rippled_rpc_method_duration_us_bucket is never populated because PerfLogImp::rpcEnd never calls MetricsRegistry::recordRpcFinished — only recordRpcStarted is wired up (Phase 9 instrumentation gap). The span-based rpc.request/rpc.process metrics via spanmetrics already cover RPC latency.	2026-04-24 20:08:52 +01:00
Pratik Mankawde	f11ebc1253	fix: StatsDGauge dirty init + tx_submitter sequence drift in CI Two CI failures traced to root cause: 1. rippled_jobq_job_count: 0 series — StatsDGaugeImpl declared m_dirty{false} despite the constructor comment saying "start dirty". Gauges whose value starts and stays at 0 never emitted, so Prometheus never scraped them. Fix: m_dirty{true} on the member initializer. 2. TX error rate 82.8% — the submitter tracked account sequences locally, but in a multi-node consensus network other nodes' txns advance sequences independently. After a few ledger closes the locally-tracked sequence fell behind the ledger, producing tefPAST_SEQ for every subsequent submission. Fix: refresh account sequences from account_info every 10 s during the submission loop.	2026-04-24 19:42:20 +01:00
Pratik Mankawde	577d1f8a21	fix: address review findings in regression gate - capture_timings.py: fail when captured/total ratio < 50% (--min-capture-ratio). Prevents silent pass on unreachable Prometheus. - run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so the final exit code reflects it. Update exit code docs in header. - compare_to_baseline.py: extract _skip_delta helper to bring compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound resolution (use explicit None check instead of `or`). Remove dead variable override_prefix_key. - prom_queries.py: extract _build_simple_entries and _build_job_entries to bring build_query_plan under 80 lines. Fix module docstring return type example. Use aiohttp.ClientTimeout instead of bare int. - telemetry-validation.yml: add set -euo pipefail to regression summary step; guard jq calls with -e flag and fallback; fail on missing baseline file; emit ::warning annotation when timings.json missing. - baselines/README.md: document the placeholder field.	2026-04-24 19:36:15 +01:00
Pratik Mankawde	df79d5e74b	feat: add OTel-driven regression gate for Phase 10 telemetry validation Captures per-span / per-RPC / per-job timings from Prometheus after the workload run and diffs them against a committed baseline. Regression requires breaching both a percentage and an absolute bound, tolerating small-value noise. When the baseline is a placeholder, the comparator emits the captured JSON in the exact schema for one-time paste into baselines/baseline-timings.json, and the CI Step Summary surfaces that block for the reviewer. Scope: gate only — automated baseline persistence, benchmark.sh PromQL migration, and the historical trend dashboard remain follow-ups.	2026-04-24 18:53:44 +01:00
Pratik Mankawde	a142a700e8	refactor(telemetry): migrate Phase 10 validation from Jaeger to Tempo native API Migrate validate_telemetry.py to Tempo TraceQL search API, remove Jaeger service from workload docker-compose, update readiness checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	ff1502f939	feat(telemetry): add workload orchestrator with phased load profiles Add a profile-driven workload orchestrator that executes sequential load phases with configurable RPC rates and TX throughput. Three profiles: full-validation (6 phases covering all 18 dashboards), quick-smoke (CI), and stress (benchmarking). Fix 10 validation failures: correct Phase 9 metric prefixes, relax peer latency bounds for localhost clusters, and allow sub-microsecond span durations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	e63ca4c495	fix(telemetry): fix dashboard UID and add parity attributes to expected_spans - Remove duplicate 'system-node-health' UID from expected_metrics.json (already covered by 'rippled-system-node-health') - Add parity span attributes to expected_spans.json: node health on rpc.command.*, validation hash/full on consensus.validation.send, quorum/proposers on consensus.accept, validation hash/full on peer.validation.receive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	711ae43174	feat(telemetry): add external dashboard parity validation checks (Task 10.8) Add ~28 validation checks for external dashboard parity: - 8 span attribute checks (server_info, tx.receive, consensus, peer spans) - 13 metric existence checks (validation agreement, validator health, peer quality, ledger economy, state tracking, counters, storage) - 3 dashboard load checks (validator-health, peer-quality, system-node-health) - 4 value sanity checks (agreement %, UNL expiry, latency, state value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	5de8c520d1	Phase 10: Workload validation - synthetic load generation and telemetry checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00

13 Commits