Commit Graph

13916 Commits

Author SHA1 Message Date
Pratik Mankawde
15bee7a01a feat(telemetry): add 7-day agreement window to validation_agreement gauge
Add agreement_pct_7d, agreements_7d, missed_7d labels to the
rippled_validation_agreement observable gauge, matching the external
xrpl-validator-dashboard's 7-day tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:40:46 +01:00
Pratik Mankawde
1408ffd794 fix(telemetry): fix ServiceRegistry API names and transaction rate computation
- cachedSLEs() -> getCachedSLEs()
- openLedger() -> getOpenLedger()
- overlay() -> getOverlay()
- Use OpenView::txCount() for transaction rate instead of SHAMap::size()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
5d8d7ad6dc feat(telemetry): wire ValidationTracker to MetricsRegistry and consensus hooks
Add ValidationTracker member to MetricsRegistry with a public accessor,
register a rippled_validation_agreement observable gauge that calls
reconcile() and reports 1h/24h agreement percentages and counts, and
hook recordOurValidation/recordNetworkValidation into RCLConsensus
validate() and LedgerMaster setValidLedger() respectively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
29e1e851a3 feat(telemetry): add validationsChecked recording hook in recvValidation
Wire incrementValidationsChecked() into NetworkOPs::recvValidation() so
each received network validation increments the counter.

Note: incrementJqTransOverflow() hook is deferred — JobQueue has no
explicit overflow event path; the counter is reserved for future use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
c4bafa3c93 fix(telemetry): add missing counters, fix dashboard metric name, clean dead code
- Add rippled_validation_agreements_total and rippled_validation_missed_total
  counter declarations and creation (wiring to ValidationTracker pending rebase)
- Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"}
  instead of non-existent rippled_Overlay_Peer_Disconnects_Charges
- Remove dead getCountsJson() call in storageDetail callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
1371aebb63 fix(telemetry): fix metric labels and add missing parity gauge values
- Rename fee labels to match spec: base_fee_drops -> base_fee_xrp,
  reserve_base_drops -> reserve_base_xrp, reserve_inc_drops -> reserve_inc_xrp
- Add peers_insane_count (stub with TODO for PeerImp::tracking_ exposure)
- Add transaction_rate to ledger economy gauge
- Replace node_store_writes/node_written_bytes with nudb_bytes per spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
2f48e68013 feat(telemetry): add external dashboard parity gauges and counters to MetricsRegistry
Add validator health, peer quality, ledger economy, state tracking, and
storage detail observable gauges plus 5 synchronous counters with recording
hooks for ledger close, validation send, state change, and overflow events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
dfda2e4185 feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
92d109ce16 docs: add external dashboard parity tasks and metric reference for Phase 9
Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
6738f8b9ab docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
  build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
  (server state, uptime, peers, validated seq, last close info,
  build version, complete ledgers, db sizes, historical fetch rate,
  peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
5e19b24ebf feat(telemetry): add push_metrics.py parity gauges to MetricsRegistry
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
43d36ff4f0 Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
6916734eae Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00
Pratik Mankawde
aa24a38d51 feat(telemetry): add 7-day validation agreement window to ValidationTracker
Add window7d_ deque, agreementPct7d(), agreements7d(), missed7d() to
match the external xrpl-validator-dashboard's 7-day agreement tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
8880d15c8c test(telemetry): add ValidationTracker unit tests
Cover normal agreement, missed validation, late repair, empty window,
grace period boundary, max pending trimming, mixed results, duplicate
recording, and only-we-validated scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
f329a5d899 fix(telemetry): fix ValidationTracker grace period boundary and hard trim
- Use >= instead of > for grace period comparison to reconcile at exactly
  8 seconds rather than skipping the boundary
- Two-pass hard trim: first remove entries past late-repair window, then
  any reconciled entry, to avoid sabotaging late repairs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
4c11ebbfc0 feat(telemetry): add ValidationTracker for validation agreement tracking (Task 7.8)
Standalone class that tracks whether this validator's validations agree
with network consensus, maintaining rolling 1h/24h windows and lifetime
totals with a late-repair mechanism for out-of-order arrivals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
fe10835a7c docs: add Tasks 7.9-7.16 for external dashboard parity metrics
Adds ValidationTracker (agreement computation with 8s grace period),
validator health, peer quality, ledger economy, state tracking,
storage detail gauges, 7 synchronous counters, and agreement gauge.

29 new metrics covering validation agreement, peer quality, UNL health,
ledger economy, state tracking, and upgrade awareness.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
4137495282 Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
7e808806a9 docs: add peerDisconnectsCharges metric to data collection reference
Bridge the existing beast::insight gauge for resource-limit peer
disconnects (peerDisconnectsCharges_) into the StatsD metric inventory.

Part of the external dashboard parity initiative.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
7ad43d4c21 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
4f8197e01d fix: remove non-existent CanonicalTXSet.h include from BuildLedger.cpp
The xrpld/app/misc/CanonicalTXSet.h header doesn't exist — it was
incorrectly added during a rebase conflict resolution. The correct
include xrpl/ledger/CanonicalTXSet.h is already present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
f35ed05bcc feat(telemetry): add validation attributes to peer.validation.receive span (Task 4.8)
Add ledger hash and full-validation flag to peer.validation.receive
spans for trace-level agreement analysis across validators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
753e7721e0 Phase 5b: Ledger, peer, and tx spans with expanded Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00
Pratik Mankawde
58aa308052 fix: use docker/telemetry/data/ for runtime data and add .gitignore
Move xrpld data paths from ./data/ to docker/telemetry/data/ so runtime
files stay within the docker telemetry directory. Add .gitignore to
exclude the data directory from version control.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:39:20 +01:00
Pratik Mankawde
c07dd573fe Phase 5: Documentation, deployment configs, integration test infrastructure
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:50:38 +01:00
Pratik Mankawde
aa329d5084 fix(telemetry): move quorum/proposers attributes to consensus.accept span
Move validation_quorum and proposers_validated attributes from
consensus.accept.apply to consensus.accept span to match the design
spec. Both values are available in onAccept() scope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:48:05 +01:00
Pratik Mankawde
27d2078419 feat(telemetry): add consensus validation span enrichment (Task 4.8)
Add validation ledger hash and full-validation flag to
consensus.validation.send spans, plus quorum and proposer count to
consensus.accept spans for trace-level agreement analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:09:49 +01:00
Pratik Mankawde
14770376c3 docs: add Task 4.8 consensus validation span enrichment for external dashboard parity
Adds ledger_hash, validation.full to validation send/receive spans,
and validation_quorum, proposers_validated to consensus.accept spans.
Foundation for Phase 7 ValidationTracker agreement computation.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:09:49 +01:00
Pratik Mankawde
69d4b77abf Phase 4: Consensus tracing - round lifecycle, proposals, validations, close time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:09:49 +01:00
Pratik Mankawde
c93e16a413 feat(telemetry): add peer version attribute to tx.receive spans (Task 3.7)
Tag transaction receive spans with the relaying peer's rippled version
to enable version-mismatch correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:09:40 +01:00
Pratik Mankawde
8dcb03d73b docs: add Task 3.8 TX span peer version attribute for external dashboard parity
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:09:40 +01:00
Pratik Mankawde
3436f93870 Phase 3: Transaction tracing - protobuf context propagation, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:09:40 +01:00
Pratik Mankawde
177c1b32db feat(telemetry): add node health attributes to RPC spans (Task 2.8)
Add amendment_blocked and server_state span attributes to every
rpc.command.* span so operators can correlate RPC behavior with node state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 12:11:13 +01:00
Pratik Mankawde
439264dc79 docs: add Task 2.8 RPC span attribute enrichment for external dashboard parity
Adds node health context (amendment_blocked, server_state) to rpc.command.*
spans, inspired by the community xrpl-validator-dashboard.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:07:10 +01:00
Pratik Mankawde
ab6946319c Phase 2: RPC tracing - span macros, attributes, WebSocket, command spans
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:07:10 +01:00
Pratik Mankawde
987ce6202e Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:06:09 +01:00
Pratik Mankawde
a00c59eb7e Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:04:31 +01:00
Pratik Mankawde
8707cc7d48 Phase 1c: RPC integration - ServerHandler tracing, telemetry config wiring
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:01:42 +01:00
Pratik Mankawde
405886719c Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:01:24 +01:00
Pratik Mankawde
3ae9084bd5 Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 16:01:24 +01:00
Pratik Mankawde
0f0c188111 Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:58:38 +01:00
Pratik Mankawde
f135842071 docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
a9bc525f22 moved presentation.md file
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
5c9102bd9a Remove effort estimates from implementation phases document
Strip effort/risk columns from task tables and remove the §6.9 Effort
Summary section with its pie chart and resource requirements table.
Renumber §6.10 Quick Wins → §6.9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
c556f3471b Add Phase 4a implementation status to plan docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
2fb6124412 Appendix: add 00-tracing-fundamentals.md and POC_taskList.md to document index
Split document index into Plan Documents and Task Lists sections.
These files were introduced in this branch but missing from the index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
e482b56f58 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
dependabot[bot]
de671863e2 ci: [DEPENDABOT] bump actions/deploy-pages from 4.0.5 to 5.0.0 (#6684)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 14:09:57 +00:00
dependabot[bot]
e0cabb9f8c ci: [DEPENDABOT] bump codecov/codecov-action from 5.5.3 to 6.0.0 (#6685)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 13:57:32 +00:00