Commit Graph

13945 Commits

Author SHA1 Message Date
Pratik Mankawde
577d1f8a21 fix: address review findings in regression gate
- capture_timings.py: fail when captured/total ratio < 50%
  (--min-capture-ratio). Prevents silent pass on unreachable Prometheus.
- run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so
  the final exit code reflects it. Update exit code docs in header.
- compare_to_baseline.py: extract _skip_delta helper to bring
  compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound
  resolution (use explicit None check instead of `or`). Remove dead
  variable override_prefix_key.
- prom_queries.py: extract _build_simple_entries and _build_job_entries
  to bring build_query_plan under 80 lines. Fix module docstring return
  type example. Use aiohttp.ClientTimeout instead of bare int.
- telemetry-validation.yml: add set -euo pipefail to regression summary
  step; guard jq calls with -e flag and fallback; fail on missing
  baseline file; emit ::warning annotation when timings.json missing.
- baselines/README.md: document the placeholder field.
2026-04-24 19:36:15 +01:00
Pratik Mankawde
df79d5e74b feat: add OTel-driven regression gate for Phase 10 telemetry validation
Captures per-span / per-RPC / per-job timings from Prometheus after the
workload run and diffs them against a committed baseline. Regression
requires breaching both a percentage and an absolute bound, tolerating
small-value noise. When the baseline is a placeholder, the comparator
emits the captured JSON in the exact schema for one-time paste into
baselines/baseline-timings.json, and the CI Step Summary surfaces that
block for the reviewer.

Scope: gate only — automated baseline persistence, benchmark.sh
PromQL migration, and the historical trend dashboard remain follow-ups.
2026-04-24 18:53:44 +01:00
Pratik Mankawde
8583343fd9 fix(telemetry): restore Loki, StatsD, filelog configs lost in rebase
The Jaeger-removal rebase used --ours conflict resolution which
dropped content added by intermediate phases (6-8): StatsD receiver,
filelog receiver, Loki service/exporter, health_check extension,
and OTLP metrics pipeline. Restore from pre-rebase origin minus
Jaeger references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 11:02:03 +01:00
Pratik Mankawde
7e149f7773 refactor(telemetry): remove residual Jaeger references across chain
Fix remaining Jaeger references that accumulated across intermediate
branches in the stacked PR chain. These were in files modified by
multiple phases where the per-branch fixes didn't cover all additions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:35:04 +01:00
Pratik Mankawde
a142a700e8 refactor(telemetry): migrate Phase 10 validation from Jaeger to Tempo native API
Migrate validate_telemetry.py to Tempo TraceQL search API, remove
Jaeger service from workload docker-compose, update readiness checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
a0eeb8eb9e fix(telemetry): fix Windows WinSock.h header ordering in MetricsRegistry test
Pre-include boost/asio/detail/socket_types.hpp on Windows before OTel
SDK headers to ensure WinSock2.h is included before WinSock.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
f4d327fda7 fix(telemetry): fix CI linker errors, check-rename, and docs build
- Add ValidationTracker.cpp to xrpl.test.telemetry target sources
  (implementation lives in src/xrpld/ but has no OTel SDK dependency)
- Change BEAST_DEFINE_TESTSUITE namespace from ripple to xrpl
- Replace recursive *.md glob with non-recursive GLOB in XrplDocs.cmake
  to avoid picking up .claude/instructions.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
ff1502f939 feat(telemetry): add workload orchestrator with phased load profiles
Add a profile-driven workload orchestrator that executes sequential load
phases with configurable RPC rates and TX throughput. Three profiles:
full-validation (6 phases covering all 18 dashboards), quick-smoke (CI),
and stress (benchmarking). Fix 10 validation failures: correct Phase 9
metric prefixes, relax peer latency bounds for localhost clusters, and
allow sub-microsecond span durations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
ecf103104b fix(telemetry): fix CI failures in MetricsRegistry test, levelization, and dashboard titles
- Update MockServiceRegistry to match current ServiceRegistry interface
  (17 method renames: get* prefix, PathRequests→PathRequestManager)
- Make throwUnimplemented() static to satisfy clang-tidy
- Regenerate levelization ordering.txt and loops.txt
- Remove 'rippled' prefix from 3 StatsD dashboard titles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
e1f30c1a22 docs: update data-collection-reference and presentation for external dashboard parity
- Fix validations_checked_total recording site (NetworkOPs, not LedgerMaster)
- Add Slide 11 to presentation: External Dashboard Parity overview with
  Mermaid diagrams for new metric categories, ValidationTracker sequence,
  and new dashboard summary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
e63ca4c495 fix(telemetry): fix dashboard UID and add parity attributes to expected_spans
- Remove duplicate 'system-node-health' UID from expected_metrics.json
  (already covered by 'rippled-system-node-health')
- Add parity span attributes to expected_spans.json: node health on
  rpc.command.*, validation hash/full on consensus.validation.send,
  quorum/proposers on consensus.accept, validation hash/full on
  peer.validation.receive

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
711ae43174 feat(telemetry): add external dashboard parity validation checks (Task 10.8)
Add ~28 validation checks for external dashboard parity:
- 8 span attribute checks (server_info, tx.receive, consensus, peer spans)
- 13 metric existence checks (validation agreement, validator health,
  peer quality, ledger economy, state tracking, counters, storage)
- 3 dashboard load checks (validator-health, peer-quality, system-node-health)
- 4 value sanity checks (agreement %, UNL expiry, latency, state value)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
898d05de66 docs: add Tasks 11.12-11.13 for external dashboard parity alerts and docs
Task 11.12: 18 Grafana alert rules (critical/network/performance groups)
for Phase 7+ parity metrics — validation agreement, state tracking,
validator health, peer quality, ledger economy.

Task 11.13: Dual-datasource architecture documentation — records the
external dashboard's fast-path pattern as a future optimization option.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
5de8c520d1 Phase 10: Workload validation - synthetic load generation and telemetry checks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
0644438549 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
d8c586b2fb Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
8cca4ec77b Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
38fca631cd docs(telemetry): replace Jaeger references in Phase 10 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
5f139e12c3 feat(telemetry): add 7-day agreement window to validation_agreement gauge
Add agreement_pct_7d, agreements_7d, missed_7d labels to the
rippled_validation_agreement observable gauge, matching the external
xrpl-validator-dashboard's 7-day tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
1defb2111f fix(telemetry): fix ServiceRegistry API names and transaction rate computation
- cachedSLEs() -> getCachedSLEs()
- openLedger() -> getOpenLedger()
- overlay() -> getOverlay()
- Use OpenView::txCount() for transaction rate instead of SHAMap::size()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
350e398aa6 feat(telemetry): wire ValidationTracker to MetricsRegistry and consensus hooks
Add ValidationTracker member to MetricsRegistry with a public accessor,
register a rippled_validation_agreement observable gauge that calls
reconcile() and reports 1h/24h agreement percentages and counts, and
hook recordOurValidation/recordNetworkValidation into RCLConsensus
validate() and LedgerMaster setValidLedger() respectively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
92607805c3 feat(telemetry): add validationsChecked recording hook in recvValidation
Wire incrementValidationsChecked() into NetworkOPs::recvValidation() so
each received network validation increments the counter.

Note: incrementJqTransOverflow() hook is deferred — JobQueue has no
explicit overflow event path; the counter is reserved for future use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
45ffe8e2ec fix(telemetry): add missing counters, fix dashboard metric name, clean dead code
- Add rippled_validation_agreements_total and rippled_validation_missed_total
  counter declarations and creation (wiring to ValidationTracker pending rebase)
- Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"}
  instead of non-existent rippled_Overlay_Peer_Disconnects_Charges
- Remove dead getCountsJson() call in storageDetail callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b0e0d5930a fix(telemetry): fix metric labels and add missing parity gauge values
- Rename fee labels to match spec: base_fee_drops -> base_fee_xrp,
  reserve_base_drops -> reserve_base_xrp, reserve_inc_drops -> reserve_inc_xrp
- Add peers_insane_count (stub with TODO for PeerImp::tracking_ exposure)
- Add transaction_rate to ledger economy gauge
- Replace node_store_writes/node_written_bytes with nudb_bytes per spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
50e6b14c56 feat(telemetry): add external dashboard parity gauges and counters to MetricsRegistry
Add validator health, peer quality, ledger economy, state tracking, and
storage detail observable gauges plus 5 synchronous counters with recording
hooks for ledger close, validation send, state change, and overflow events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b92354715d feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
81298ceb9f docs: add external dashboard parity tasks and metric reference for Phase 9
Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
936c73982d docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
  build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
  (server state, uptime, peers, validated seq, last close info,
  build version, complete ledgers, db sizes, historical fetch rate,
  peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
d426f4983a feat(telemetry): add push_metrics.py parity gauges to MetricsRegistry
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
892fee638a Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
facc111c22 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
5ec9f3f30a Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
8f364ed6f4 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
30c430aec8 docs(telemetry): replace Jaeger references in Phase 8 docs and runbook
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
fdec3ce5c4 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
aa062ecdbe Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
0e15f95543 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
eca887c66e feat(telemetry): add 7-day validation agreement window to ValidationTracker
Add window7d_ deque, agreementPct7d(), agreements7d(), missed7d() to
match the external xrpl-validator-dashboard's 7-day agreement tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
f51976f63e test(telemetry): add ValidationTracker unit tests
Cover normal agreement, missed validation, late repair, empty window,
grace period boundary, max pending trimming, mixed results, duplicate
recording, and only-we-validated scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
1f2a36b316 fix(telemetry): fix ValidationTracker grace period boundary and hard trim
- Use >= instead of > for grace period comparison to reconcile at exactly
  8 seconds rather than skipping the boundary
- Two-pass hard trim: first remove entries past late-repair window, then
  any reconciled entry, to avoid sabotaging late repairs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
8365f7dda3 feat(telemetry): add ValidationTracker for validation agreement tracking (Task 7.8)
Standalone class that tracks whether this validator's validations agree
with network consensus, maintaining rolling 1h/24h windows and lifetime
totals with a late-repair mechanism for out-of-order arrivals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
391b8f91ce docs: add Tasks 7.9-7.16 for external dashboard parity metrics
Adds ValidationTracker (agreement computation with 8s grace period),
validator health, peer quality, ledger economy, state tracking,
storage detail gauges, 7 synchronous counters, and agreement gauge.

29 new metrics covering validation agreement, peer quality, UNL health,
ledger economy, state tracking, and upgrade awareness.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
2f7064ace6 Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
1ef234de9d docs(telemetry): replace Jaeger with Tempo in data collection reference
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
a37cf74868 docs: add peerDisconnectsCharges metric to data collection reference
Bridge the existing beast::insight gauge for resource-limit peer
disconnects (peerDisconnectsCharges_) into the StatsD metric inventory.

Part of the external dashboard parity initiative.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
21192e9b3f Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
2a2c9dc5dc fix: remove non-existent CanonicalTXSet.h include from BuildLedger.cpp
The xrpld/app/misc/CanonicalTXSet.h header doesn't exist — it was
incorrectly added during a rebase conflict resolution. The correct
include xrpl/ledger/CanonicalTXSet.h is already present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
6723815563 feat(telemetry): add validation attributes to peer.validation.receive span (Task 4.8)
Add ledger hash and full-validation flag to peer.validation.receive
spans for trace-level agreement analysis across validators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
7e5591318f Phase 5b: Ledger, peer, and tx spans with expanded Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
87ed778efe refactor(telemetry): migrate integration test and docs from Jaeger to Tempo API
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00