Commit Graph

13930 Commits

Author SHA1 Message Date
Pratik Mankawde
d8c586b2fb Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
8cca4ec77b Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
38fca631cd docs(telemetry): replace Jaeger references in Phase 10 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
5f139e12c3 feat(telemetry): add 7-day agreement window to validation_agreement gauge
Add agreement_pct_7d, agreements_7d, missed_7d labels to the
rippled_validation_agreement observable gauge, matching the external
xrpl-validator-dashboard's 7-day tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
1defb2111f fix(telemetry): fix ServiceRegistry API names and transaction rate computation
- cachedSLEs() -> getCachedSLEs()
- openLedger() -> getOpenLedger()
- overlay() -> getOverlay()
- Use OpenView::txCount() for transaction rate instead of SHAMap::size()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
350e398aa6 feat(telemetry): wire ValidationTracker to MetricsRegistry and consensus hooks
Add ValidationTracker member to MetricsRegistry with a public accessor,
register a rippled_validation_agreement observable gauge that calls
reconcile() and reports 1h/24h agreement percentages and counts, and
hook recordOurValidation/recordNetworkValidation into RCLConsensus
validate() and LedgerMaster setValidLedger() respectively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
92607805c3 feat(telemetry): add validationsChecked recording hook in recvValidation
Wire incrementValidationsChecked() into NetworkOPs::recvValidation() so
each received network validation increments the counter.

Note: incrementJqTransOverflow() hook is deferred — JobQueue has no
explicit overflow event path; the counter is reserved for future use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
45ffe8e2ec fix(telemetry): add missing counters, fix dashboard metric name, clean dead code
- Add rippled_validation_agreements_total and rippled_validation_missed_total
  counter declarations and creation (wiring to ValidationTracker pending rebase)
- Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"}
  instead of non-existent rippled_Overlay_Peer_Disconnects_Charges
- Remove dead getCountsJson() call in storageDetail callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b0e0d5930a fix(telemetry): fix metric labels and add missing parity gauge values
- Rename fee labels to match spec: base_fee_drops -> base_fee_xrp,
  reserve_base_drops -> reserve_base_xrp, reserve_inc_drops -> reserve_inc_xrp
- Add peers_insane_count (stub with TODO for PeerImp::tracking_ exposure)
- Add transaction_rate to ledger economy gauge
- Replace node_store_writes/node_written_bytes with nudb_bytes per spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
50e6b14c56 feat(telemetry): add external dashboard parity gauges and counters to MetricsRegistry
Add validator health, peer quality, ledger economy, state tracking, and
storage detail observable gauges plus 5 synchronous counters with recording
hooks for ledger close, validation send, state change, and overflow events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b92354715d feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
81298ceb9f docs: add external dashboard parity tasks and metric reference for Phase 9
Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
936c73982d docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
  build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
  (server state, uptime, peers, validated seq, last close info,
  build version, complete ledgers, db sizes, historical fetch rate,
  peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
d426f4983a feat(telemetry): add push_metrics.py parity gauges to MetricsRegistry
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
892fee638a Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
facc111c22 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
5ec9f3f30a Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
8f364ed6f4 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
30c430aec8 docs(telemetry): replace Jaeger references in Phase 8 docs and runbook
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
fdec3ce5c4 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
aa062ecdbe Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
0e15f95543 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
eca887c66e feat(telemetry): add 7-day validation agreement window to ValidationTracker
Add window7d_ deque, agreementPct7d(), agreements7d(), missed7d() to
match the external xrpl-validator-dashboard's 7-day agreement tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
f51976f63e test(telemetry): add ValidationTracker unit tests
Cover normal agreement, missed validation, late repair, empty window,
grace period boundary, max pending trimming, mixed results, duplicate
recording, and only-we-validated scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
1f2a36b316 fix(telemetry): fix ValidationTracker grace period boundary and hard trim
- Use >= instead of > for grace period comparison to reconcile at exactly
  8 seconds rather than skipping the boundary
- Two-pass hard trim: first remove entries past late-repair window, then
  any reconciled entry, to avoid sabotaging late repairs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
8365f7dda3 feat(telemetry): add ValidationTracker for validation agreement tracking (Task 7.8)
Standalone class that tracks whether this validator's validations agree
with network consensus, maintaining rolling 1h/24h windows and lifetime
totals with a late-repair mechanism for out-of-order arrivals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
391b8f91ce docs: add Tasks 7.9-7.16 for external dashboard parity metrics
Adds ValidationTracker (agreement computation with 8s grace period),
validator health, peer quality, ledger economy, state tracking,
storage detail gauges, 7 synchronous counters, and agreement gauge.

29 new metrics covering validation agreement, peer quality, UNL health,
ledger economy, state tracking, and upgrade awareness.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
2f7064ace6 Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
1ef234de9d docs(telemetry): replace Jaeger with Tempo in data collection reference
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
a37cf74868 docs: add peerDisconnectsCharges metric to data collection reference
Bridge the existing beast::insight gauge for resource-limit peer
disconnects (peerDisconnectsCharges_) into the StatsD metric inventory.

Part of the external dashboard parity initiative.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
21192e9b3f Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
2a2c9dc5dc fix: remove non-existent CanonicalTXSet.h include from BuildLedger.cpp
The xrpld/app/misc/CanonicalTXSet.h header doesn't exist — it was
incorrectly added during a rebase conflict resolution. The correct
include xrpl/ledger/CanonicalTXSet.h is already present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
6723815563 feat(telemetry): add validation attributes to peer.validation.receive span (Task 4.8)
Add ledger hash and full-validation flag to peer.validation.receive
spans for trace-level agreement analysis across validators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
7e5591318f Phase 5b: Ledger, peer, and tx spans with expanded Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
87ed778efe refactor(telemetry): migrate integration test and docs from Jaeger to Tempo API
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
d0ff82801c fix: use docker/telemetry/data/ for runtime data and add .gitignore
Move xrpld data paths from ./data/ to docker/telemetry/data/ so runtime
files stay within the docker telemetry directory. Add .gitignore to
exclude the data directory from version control.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
f940290866 Phase 5: Documentation, deployment configs, integration test infrastructure
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
014060370a fix(telemetry): move quorum/proposers attributes to consensus.accept span
Move validation_quorum and proposers_validated attributes from
consensus.accept.apply to consensus.accept span to match the design
spec. Both values are available in onAccept() scope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:33 +01:00
Pratik Mankawde
8c222b9e05 feat(telemetry): add consensus validation span enrichment (Task 4.8)
Add validation ledger hash and full-validation flag to
consensus.validation.send spans, plus quorum and proposer count to
consensus.accept spans for trace-level agreement analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:33 +01:00
Pratik Mankawde
95f0c8bf51 docs: add Task 4.8 consensus validation span enrichment for external dashboard parity
Adds ledger_hash, validation.full to validation send/receive spans,
and validation_quorum, proposers_validated to consensus.accept spans.
Foundation for Phase 7 ValidationTracker agreement computation.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:33 +01:00
Pratik Mankawde
a127711b86 Phase 4: Consensus tracing - round lifecycle, proposals, validations, close time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:33 +01:00
Pratik Mankawde
715c531512 feat(telemetry): add peer version attribute to tx.receive spans (Task 3.7)
Tag transaction receive spans with the relaying peer's rippled version
to enable version-mismatch correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:27 +01:00
Pratik Mankawde
e6508a5bbc docs: add Task 3.8 TX span peer version attribute for external dashboard parity
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:27 +01:00
Pratik Mankawde
88d17e4c04 Phase 3: Transaction tracing - protobuf context propagation, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:27 +01:00
Pratik Mankawde
9ab8570153 docs(telemetry): replace Jaeger references with Tempo in Phase 2-5 task lists
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
8f2507a945 feat(telemetry): add node health attributes to RPC spans (Task 2.8)
Add amendment_blocked and server_state span attributes to every
rpc.command.* span so operators can correlate RPC behavior with node state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
befffc573c docs: add Task 2.8 RPC span attribute enrichment for external dashboard parity
Adds node health context (amendment_blocked, server_state) to rpc.command.*
spans, inspired by the community xrpl-validator-dashboard.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
945faac770 Phase 2: RPC tracing - span macros, attributes, WebSocket, command spans
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
c8b1686ce4 Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
ba92ccad14 Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00