rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-30 18:40:28 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	577d1f8a21	fix: address review findings in regression gate - capture_timings.py: fail when captured/total ratio < 50% (--min-capture-ratio). Prevents silent pass on unreachable Prometheus. - run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so the final exit code reflects it. Update exit code docs in header. - compare_to_baseline.py: extract _skip_delta helper to bring compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound resolution (use explicit None check instead of `or`). Remove dead variable override_prefix_key. - prom_queries.py: extract _build_simple_entries and _build_job_entries to bring build_query_plan under 80 lines. Fix module docstring return type example. Use aiohttp.ClientTimeout instead of bare int. - telemetry-validation.yml: add set -euo pipefail to regression summary step; guard jq calls with -e flag and fallback; fail on missing baseline file; emit ::warning annotation when timings.json missing. - baselines/README.md: document the placeholder field.	2026-04-24 19:36:15 +01:00
Pratik Mankawde	df79d5e74b	feat: add OTel-driven regression gate for Phase 10 telemetry validation Captures per-span / per-RPC / per-job timings from Prometheus after the workload run and diffs them against a committed baseline. Regression requires breaching both a percentage and an absolute bound, tolerating small-value noise. When the baseline is a placeholder, the comparator emits the captured JSON in the exact schema for one-time paste into baselines/baseline-timings.json, and the CI Step Summary surfaces that block for the reviewer. Scope: gate only — automated baseline persistence, benchmark.sh PromQL migration, and the historical trend dashboard remain follow-ups.	2026-04-24 18:53:44 +01:00
Pratik Mankawde	8583343fd9	fix(telemetry): restore Loki, StatsD, filelog configs lost in rebase The Jaeger-removal rebase used --ours conflict resolution which dropped content added by intermediate phases (6-8): StatsD receiver, filelog receiver, Loki service/exporter, health_check extension, and OTLP metrics pipeline. Restore from pre-rebase origin minus Jaeger references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 11:02:03 +01:00
Pratik Mankawde	7e149f7773	refactor(telemetry): remove residual Jaeger references across chain Fix remaining Jaeger references that accumulated across intermediate branches in the stacked PR chain. These were in files modified by multiple phases where the per-branch fixes didn't cover all additions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:35:04 +01:00
Pratik Mankawde	a142a700e8	refactor(telemetry): migrate Phase 10 validation from Jaeger to Tempo native API Migrate validate_telemetry.py to Tempo TraceQL search API, remove Jaeger service from workload docker-compose, update readiness checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	a0eeb8eb9e	fix(telemetry): fix Windows WinSock.h header ordering in MetricsRegistry test Pre-include boost/asio/detail/socket_types.hpp on Windows before OTel SDK headers to ensure WinSock2.h is included before WinSock.h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	f4d327fda7	fix(telemetry): fix CI linker errors, check-rename, and docs build - Add ValidationTracker.cpp to xrpl.test.telemetry target sources (implementation lives in src/xrpld/ but has no OTel SDK dependency) - Change BEAST_DEFINE_TESTSUITE namespace from ripple to xrpl - Replace recursive *.md glob with non-recursive GLOB in XrplDocs.cmake to avoid picking up .claude/instructions.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	ff1502f939	feat(telemetry): add workload orchestrator with phased load profiles Add a profile-driven workload orchestrator that executes sequential load phases with configurable RPC rates and TX throughput. Three profiles: full-validation (6 phases covering all 18 dashboards), quick-smoke (CI), and stress (benchmarking). Fix 10 validation failures: correct Phase 9 metric prefixes, relax peer latency bounds for localhost clusters, and allow sub-microsecond span durations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	ecf103104b	fix(telemetry): fix CI failures in MetricsRegistry test, levelization, and dashboard titles - Update MockServiceRegistry to match current ServiceRegistry interface (17 method renames: get* prefix, PathRequests→PathRequestManager) - Make throwUnimplemented() static to satisfy clang-tidy - Regenerate levelization ordering.txt and loops.txt - Remove 'rippled' prefix from 3 StatsD dashboard titles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	e1f30c1a22	docs: update data-collection-reference and presentation for external dashboard parity - Fix validations_checked_total recording site (NetworkOPs, not LedgerMaster) - Add Slide 11 to presentation: External Dashboard Parity overview with Mermaid diagrams for new metric categories, ValidationTracker sequence, and new dashboard summary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	e63ca4c495	fix(telemetry): fix dashboard UID and add parity attributes to expected_spans - Remove duplicate 'system-node-health' UID from expected_metrics.json (already covered by 'rippled-system-node-health') - Add parity span attributes to expected_spans.json: node health on rpc.command.*, validation hash/full on consensus.validation.send, quorum/proposers on consensus.accept, validation hash/full on peer.validation.receive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	711ae43174	feat(telemetry): add external dashboard parity validation checks (Task 10.8) Add ~28 validation checks for external dashboard parity: - 8 span attribute checks (server_info, tx.receive, consensus, peer spans) - 13 metric existence checks (validation agreement, validator health, peer quality, ledger economy, state tracking, counters, storage) - 3 dashboard load checks (validator-health, peer-quality, system-node-health) - 4 value sanity checks (agreement %, UNL expiry, latency, state value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	898d05de66	docs: add Tasks 11.12-11.13 for external dashboard parity alerts and docs Task 11.12: 18 Grafana alert rules (critical/network/performance groups) for Phase 7+ parity metrics — validation agreement, state tracking, validator health, peer quality, ledger economy. Task 11.13: Dual-datasource architecture documentation — records the external dashboard's fast-path pattern as a future optimization option. Source: external dashboard parity design spec (2026-03-30). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	5de8c520d1	Phase 10: Workload validation - synthetic load generation and telemetry checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	0644438549	Phase 8: Log-trace correlation with Loki and filelog receiver Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:01 +01:00
Pratik Mankawde	d8c586b2fb	Phase 7: Native OTel metrics migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:01 +01:00
Pratik Mankawde	8cca4ec77b	Phase 6: StatsD metrics integration into telemetry pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:01 +01:00
Pratik Mankawde	38fca631cd	docs(telemetry): replace Jaeger references in Phase 10 task list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	5f139e12c3	feat(telemetry): add 7-day agreement window to validation_agreement gauge Add agreement_pct_7d, agreements_7d, missed_7d labels to the rippled_validation_agreement observable gauge, matching the external xrpl-validator-dashboard's 7-day tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	1defb2111f	fix(telemetry): fix ServiceRegistry API names and transaction rate computation - cachedSLEs() -> getCachedSLEs() - openLedger() -> getOpenLedger() - overlay() -> getOverlay() - Use OpenView::txCount() for transaction rate instead of SHAMap::size() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	350e398aa6	feat(telemetry): wire ValidationTracker to MetricsRegistry and consensus hooks Add ValidationTracker member to MetricsRegistry with a public accessor, register a rippled_validation_agreement observable gauge that calls reconcile() and reports 1h/24h agreement percentages and counts, and hook recordOurValidation/recordNetworkValidation into RCLConsensus validate() and LedgerMaster setValidLedger() respectively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	92607805c3	feat(telemetry): add validationsChecked recording hook in recvValidation Wire incrementValidationsChecked() into NetworkOPs::recvValidation() so each received network validation increments the counter. Note: incrementJqTransOverflow() hook is deferred — JobQueue has no explicit overflow event path; the counter is reserved for future use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	45ffe8e2ec	fix(telemetry): add missing counters, fix dashboard metric name, clean dead code - Add rippled_validation_agreements_total and rippled_validation_missed_total counter declarations and creation (wiring to ValidationTracker pending rebase) - Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"} instead of non-existent rippled_Overlay_Peer_Disconnects_Charges - Remove dead getCountsJson() call in storageDetail callback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	b0e0d5930a	fix(telemetry): fix metric labels and add missing parity gauge values - Rename fee labels to match spec: base_fee_drops -> base_fee_xrp, reserve_base_drops -> reserve_base_xrp, reserve_inc_drops -> reserve_inc_xrp - Add peers_insane_count (stub with TODO for PeerImp::tracking_ exposure) - Add transaction_rate to ledger economy gauge - Replace node_store_writes/node_written_bytes with nudb_bytes per spec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	50e6b14c56	feat(telemetry): add external dashboard parity gauges and counters to MetricsRegistry Add validator health, peer quality, ledger economy, state tracking, and storage detail observable gauges plus 5 synchronous counters with recording hooks for ledger close, validation send, state change, and overflow events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	b92354715d	feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	81298ceb9f	docs: add external dashboard parity tasks and metric reference for Phase 9 Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards), new metric tables in data-collection-reference, and monitoring sections in runbook covering validation agreement, validator health, peer quality, and state tracking. Source: external dashboard parity design spec (2026-03-30). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	936c73982d	docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges - Add Task 9.7a to Phase9_taskList.md documenting new gauges - Add metric tables to 09-data-collection-reference.md (server_info, build_info, complete_ledgers, db_metrics, extended cache/nodestore) - Update metric counts from ~50 to ~68 in 06-implementation-phases.md - Add OTel MetricsRegistry gauge reference to telemetry-runbook.md - Add 11 new panels to system-node-health.json Grafana dashboard (server state, uptime, peers, validated seq, last close info, build version, complete ledgers, db sizes, historical fetch rate, peer disconnects) - Fix leftover merge conflict marker in 08-appendix.md - Add ripplex/mseconds to cspell dictionary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	d426f4983a	feat(telemetry): add push_metrics.py parity gauges to MetricsRegistry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	892fee638a	Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	facc111c22	Phase 8: Log-trace correlation with Loki and filelog receiver Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	5ec9f3f30a	Phase 7: Native OTel metrics migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	8f364ed6f4	Phase 6: StatsD metrics integration into telemetry pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	30c430aec8	docs(telemetry): replace Jaeger references in Phase 8 docs and runbook Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:37 +01:00
Pratik Mankawde	fdec3ce5c4	Phase 8: Log-trace correlation with Loki and filelog receiver Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:37 +01:00
Pratik Mankawde	aa062ecdbe	Phase 7: Native OTel metrics migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:37 +01:00
Pratik Mankawde	0e15f95543	Phase 6: StatsD metrics integration into telemetry pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:37 +01:00
Pratik Mankawde	eca887c66e	feat(telemetry): add 7-day validation agreement window to ValidationTracker Add window7d_ deque, agreementPct7d(), agreements7d(), missed7d() to match the external xrpl-validator-dashboard's 7-day agreement tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	f51976f63e	test(telemetry): add ValidationTracker unit tests Cover normal agreement, missed validation, late repair, empty window, grace period boundary, max pending trimming, mixed results, duplicate recording, and only-we-validated scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	1f2a36b316	fix(telemetry): fix ValidationTracker grace period boundary and hard trim - Use >= instead of > for grace period comparison to reconcile at exactly 8 seconds rather than skipping the boundary - Two-pass hard trim: first remove entries past late-repair window, then any reconciled entry, to avoid sabotaging late repairs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	8365f7dda3	feat(telemetry): add ValidationTracker for validation agreement tracking (Task 7.8) Standalone class that tracks whether this validator's validations agree with network consensus, maintaining rolling 1h/24h windows and lifetime totals with a late-repair mechanism for out-of-order arrivals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	391b8f91ce	docs: add Tasks 7.9-7.16 for external dashboard parity metrics Adds ValidationTracker (agreement computation with 8s grace period), validator health, peer quality, ledger economy, state tracking, storage detail gauges, 7 synchronous counters, and agreement gauge. 29 new metrics covering validation agreement, peer quality, UNL health, ledger economy, state tracking, and upgrade awareness. Part of the external dashboard parity initiative across phases 2-11. See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	2f7064ace6	Phase 7: Native OTel metrics migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	1ef234de9d	docs(telemetry): replace Jaeger with Tempo in data collection reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00
Pratik Mankawde	a37cf74868	docs: add peerDisconnectsCharges metric to data collection reference Bridge the existing beast::insight gauge for resource-limit peer disconnects (peerDisconnectsCharges_) into the StatsD metric inventory. Part of the external dashboard parity initiative. See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00
Pratik Mankawde	21192e9b3f	Phase 6: StatsD metrics integration into telemetry pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00
Pratik Mankawde	2a2c9dc5dc	fix: remove non-existent CanonicalTXSet.h include from BuildLedger.cpp The xrpld/app/misc/CanonicalTXSet.h header doesn't exist — it was incorrectly added during a rebase conflict resolution. The correct include xrpl/ledger/CanonicalTXSet.h is already present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:30:59 +01:00
Pratik Mankawde	6723815563	feat(telemetry): add validation attributes to peer.validation.receive span (Task 4.8) Add ledger hash and full-validation flag to peer.validation.receive spans for trace-level agreement analysis across validators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:30:59 +01:00
Pratik Mankawde	7e5591318f	Phase 5b: Ledger, peer, and tx spans with expanded Grafana dashboards Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:30:59 +01:00
Pratik Mankawde	87ed778efe	refactor(telemetry): migrate integration test and docs from Jaeger to Tempo API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:29:30 +01:00

1 2 3 4 5 ...

13945 Commits