mirror of https://github.com/XRPLF/rippled.git synced 2026-04-03 18:42:29 +00:00

Files

Pratik Mankawde 92d109ce16 docs: add external dashboard parity tasks and metric reference for Phase 9

Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-31 16:39:40 +01:00

21 KiB

Raw Blame History

Phase 9: Internal Metric Instrumentation Gap Fill — Task List

Status: Future Enhancement

Goal: Instrument rippled to emit ~50+ metrics that exist in get_counts/server_info/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines.

Scope: Hybrid approach — extend beast::insight for metrics near existing registrations, use OTel Metrics SDK ObservableGauge callbacks for new categories (TxQ, PerfLog, CountedObjects).

Branch: pratik/otel-phase9-metric-gap-fill (from pratik/otel-phase8-log-correlation)

Depends on: Phase 7 (native OTel metrics pipeline) and Phase 8 (log-trace correlation)

Document	Relevance
06-implementation-phases.md	Phase 9 plan: motivation, architecture, exit criteria (§6.8.2)
09-data-collection-reference.md	Current metric inventory + future metrics section
Phase7_taskList.md	Prerequisite — OTel Metrics SDK and `OTelCollector` class
Phase8_taskList.md	Prerequisite — log-trace correlation

Third-Party Consumer Context

These metrics serve multiple external consumer categories identified during research:

Consumer Category	Key Metrics They Need
Exchanges	Fee escalation levels, TxQ depth, settlement latency
Payment Processors	Load factors, io_latency, transaction throughput
Analytics Providers	NodeStore I/O, cache hit rates, counted objects
Validators/Operators	Per-job execution times, PerfLog RPC counters, consensus timing
Academic Researchers	Consensus performance time-series, fee market dynamics
Institutional Custody	Server health scores, reserve calculations, node availability

Task 9.1: NodeStore I/O Metrics

Objective: Export node store read/write performance as time-series metrics.

What to do:

In src/libxrpl/nodestore/Database.cpp, extend existing beast::insight registrations to add:
- Gauge: node_reads_total (cumulative read operations)
- Gauge: node_reads_hit (cache-served reads)
- Gauge: node_writes (cumulative write operations)
- Gauge: node_written_bytes (cumulative bytes written)
- Gauge: node_read_bytes (cumulative bytes read)
- Gauge: node_reads_duration_us (cumulative read time in microseconds)
- Gauge: write_load (current write load score)
- Gauge: read_queue (items in read queue)
These values are already computed in Database::getCountsJson() (line ~236). Wire the same counters to beast::insight hooks.

Key modified files:

src/libxrpl/nodestore/Database.cpp
src/libxrpl/nodestore/Database.h (add insight members)

Derived Prometheus metrics: rippled_nodestore_reads_total, rippled_nodestore_reads_hit, rippled_nodestore_write_load, etc.

Grafana dashboard: Add "NodeStore I/O" panel group to Node Health dashboard.

Task 9.2: Cache Hit Rate Metrics

Objective: Export SHAMap and ledger cache performance as time-series gauges.

What to do:

Register OTel ObservableGauge callbacks (via Phase 7's OTelCollector) for:
- SLE_hit_rate — SLE cache hit rate (0.0–1.0)
- ledger_hit_rate — Ledger object cache hit rate
- AL_hit_rate — AcceptedLedger cache hit rate
- treenode_cache_size — SHAMap TreeNode cache size (entries)
- treenode_track_size — Tracked tree nodes
- fullbelow_size — FullBelow cache size
The callback should read from the same sources as GetCounts.cpp handler (line ~43).
Create a centralized MetricsRegistry class that holds all OTel async gauge registrations, polled at 10-second intervals by the PeriodicMetricReader.

Key modified files:

New: src/xrpld/telemetry/MetricsRegistry.h / .cpp
src/xrpld/rpc/handlers/GetCounts.cpp (extract shared access methods)
src/xrpld/app/main/Application.cpp (register MetricsRegistry at startup)

Derived Prometheus metrics: rippled_cache_SLE_hit_rate, rippled_cache_ledger_hit_rate, rippled_cache_treenode_size, etc.

Task 9.3: Transaction Queue (TxQ) Metrics

Objective: Export TxQ depth, capacity, and fee escalation levels as time-series.

What to do:

Register OTel ObservableGauge callbacks for TxQ state (from TxQ.h line ~143):
- txq_count — Current transactions in queue
- txq_max_size — Maximum queue capacity
- txq_in_ledger — Transactions in current open ledger
- txq_per_ledger — Expected transactions per ledger
- txq_reference_fee_level — Reference fee level
- txq_min_processing_fee_level — Minimum fee to get processed
- txq_med_fee_level — Median fee level in queue
- txq_open_ledger_fee_level — Open ledger fee escalation level
Add to the MetricsRegistry (Task 9.2).

Key modified files:

src/xrpld/telemetry/MetricsRegistry.cpp (add TxQ callbacks)
src/xrpld/app/tx/detail/TxQ.h (expose metrics accessor if needed)

Derived Prometheus metrics: rippled_txq_count, rippled_txq_max_size, rippled_txq_open_ledger_fee_level, etc.

Grafana dashboard: New Fee Market & TxQ dashboard (rippled-fee-market).

Task 9.4: PerfLog Per-RPC Method Metrics

Objective: Export per-RPC-method call counts and latency as OTel metrics.

What to do:

Register OTel instruments for PerfLog RPC counters (from PerfLogImp.cpp line ~63):
- Counter: rpc_method_started_total{method="<name>"} — calls started
- Counter: rpc_method_finished_total{method="<name>"} — calls completed
- Counter: rpc_method_errored_total{method="<name>"} — calls errored
- Histogram: rpc_method_duration_us{method="<name>"} — execution time distribution
Use OTel Counter<int64_t> and Histogram<double> instruments with method attribute label.
Hook into the existing PerfLog callback mechanism rather than adding new instrumentation points.

Key modified files:

src/xrpld/perflog/detail/PerfLogImp.cpp (add OTel instrument updates alongside existing JSON counters)
src/xrpld/telemetry/MetricsRegistry.cpp (register instruments)

Derived Prometheus metrics: rippled_rpc_method_started_total{method="server_info"}, rippled_rpc_method_duration_us_bucket{method="ledger"}, etc.

Grafana dashboard: Add "Per-Method RPC Breakdown" panel group to RPC Performance dashboard.

Task 9.5: PerfLog Per-Job-Type Metrics

Objective: Export per-job-type queue and execution metrics.

What to do:

Register OTel instruments for PerfLog job counters:
- Counter: job_queued_total{job_type="<name>"} — jobs queued
- Counter: job_started_total{job_type="<name>"} — jobs started
- Counter: job_finished_total{job_type="<name>"} — jobs completed
- Histogram: job_queued_duration_us{job_type="<name>"} — time spent waiting in queue
- Histogram: job_running_duration_us{job_type="<name>"} — execution time distribution
Hook into PerfLog's existing job tracking alongside Task 9.4.

Key modified files:

src/xrpld/perflog/detail/PerfLogImp.cpp
src/xrpld/telemetry/MetricsRegistry.cpp

Derived Prometheus metrics: rippled_job_queued_total{job_type="ledgerData"}, rippled_job_running_duration_us_bucket{job_type="transaction"}, etc.

Grafana dashboard: New Job Queue Analysis dashboard (rippled-job-queue).

Task 9.6: Counted Object Instance Metrics

Objective: Export live instance counts for key internal object types.

What to do:

Register OTel ObservableGauge callbacks for CountedObject<T> instance counts:
- object_count{type="Transaction"} — live Transaction objects
- object_count{type="Ledger"} — live Ledger objects
- object_count{type="NodeObject"} — live NodeObject instances
- object_count{type="STTx"} — serialized transaction objects
- object_count{type="STLedgerEntry"} — serialized ledger entries
- object_count{type="InboundLedger"} — ledgers being fetched
- object_count{type="Pathfinder"} — active pathfinding computations
- object_count{type="PathRequest"} — active path requests
- object_count{type="HashRouterEntry"} — hash router entries
The CountedObject template already tracks these via atomic counters. The callback just reads the current counts.

Key modified files:

src/xrpld/telemetry/MetricsRegistry.cpp (add counted object callbacks)
include/xrpl/basics/CountedObject.h (may need static accessor for iteration)

Derived Prometheus metrics: rippled_object_count{type="Transaction"}, rippled_object_count{type="NodeObject"}, etc.

Grafana dashboard: Add "Object Instance Counts" panel to Node Health dashboard.

Task 9.7: Fee Escalation & Load Factor Metrics

Objective: Export the full load factor breakdown as time-series.

What to do:

Register OTel ObservableGauge callbacks for load factors (from NetworkOPs.cpp line ~2694):
- load_factor — combined transaction cost multiplier
- load_factor_server — server + cluster + network contribution
- load_factor_local — local server load only
- load_factor_net — network-wide load estimate
- load_factor_cluster — cluster peer load
- load_factor_fee_escalation — open ledger fee escalation
- load_factor_fee_queue — queue entry fee level
These overlap with some existing StatsD metrics but provide finer granularity (individual factor breakdown vs. combined value).

Key modified files:

src/xrpld/telemetry/MetricsRegistry.cpp
src/xrpld/app/misc/NetworkOPs.cpp (expose load factor accessors if needed)

Derived Prometheus metrics: rippled_load_factor, rippled_load_factor_fee_escalation, etc.

Grafana dashboard: Add "Load Factor Breakdown" panel to Fee Market & TxQ dashboard.

Task 9.7a: push_metrics.py Parity — Missing Observable Gauges

Objective: Fill the remaining metric gaps between the external push_metrics.py script (in ripplex-ansible) and the internal OTel MetricsRegistry observable gauges. After this task, all metrics collected by push_metrics.py that CAN be collected internally are covered.

What was done:

Extended existing cacheHitRateGauge_ callback with AL_size (AcceptedLedger cache size)
Extended existing nodeStoreGauge_ callback with 4 new metrics from getCountsJson():
- node_reads_duration_us (JSON string — uses std::stoll(asString()))
- read_request_bundle (native JSON int)
- read_threads_running (native JSON int)
- read_threads_total (native JSON int)
Added new rippled_server_info Int64ObservableGauge with 8 metrics:
- server_state — operating mode as int (0=DISCONNECTED .. 4=FULL)
- uptime — seconds since server start
- peers — total peer count
- validated_ledger_seq — validated ledger sequence (atomic read)
- ledger_current_index — current open ledger sequence
- peer_disconnects_resources — cumulative resource-related disconnects
- last_close_proposers — from getConsensusInfo()["previous_proposers"]
- last_close_converge_time_ms — from getConsensusInfo()["previous_mseconds"]
Added new rippled_build_info Int64ObservableGauge (info-style, value=1 with version label)
Added new rippled_complete_ledgers Int64ObservableGauge parsing comma-separated ranges into {bound, index} pairs
Added new rippled_db_metrics Int64ObservableGauge with 4 metrics:
- db_kb_total, db_kb_ledger, db_kb_transaction (SQLite stat queries)
- historical_perminute (historical ledger fetch rate)

Key modified files:

src/xrpld/telemetry/MetricsRegistry.h (4 new gauge members, updated ASCII diagram)
src/xrpld/telemetry/MetricsRegistry.cpp (4 new callback registrations, 2 callback extensions)

Not implementable inside rippled:

connection_count_51233/51234 — OS-level port connection counts from external shell script (get_connection.sh)

Derived Prometheus metrics: rippled_server_info{metric="server_state"}, rippled_build_info{version="2.4.0"}, rippled_complete_ledgers{bound="start",index="0"}, rippled_db_metrics{metric="db_kb_total"}, etc.

Grafana dashboard: New panels added to Node Health dashboard (system-node-health.json).

Task 9.8: New Grafana Dashboards

Objective: Create Grafana dashboards for the new metric categories.

What to do:

Create 2 new dashboards:
1. Fee Market & TxQ (rippled-fee-market) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline
2. Job Queue Analysis (rippled-job-queue) — Per-job-type rates, queue wait times, execution times, job queue depth
Update 2 existing dashboards:
1. Node Health (rippled-statsd-node-health) — Add NodeStore I/O panels, cache hit rate panels, object instance counts
2. RPC Performance (rippled-rpc-perf) — Add per-method RPC breakdown panels

Key modified files:

New: docker/telemetry/grafana/dashboards/rippled-fee-market.json
New: docker/telemetry/grafana/dashboards/rippled-job-queue.json
docker/telemetry/grafana/dashboards/rippled-statsd-node-health.json
docker/telemetry/grafana/dashboards/rippled-rpc-perf.json

Task 9.9: Update Documentation

Objective: Update telemetry reference docs with all new metrics.

What to do:

Update OpenTelemetryPlan/09-data-collection-reference.md:
- Add new section for OTel SDK-exported metrics (NodeStore, cache, TxQ, PerfLog, CountedObjects, load factors)
- Update Grafana dashboard reference table (add 2 new dashboards)
- Add Prometheus query examples for new metrics
Update docs/telemetry-runbook.md:
- Add alerting rules for new metrics (NodeStore write_load, TxQ capacity, cache hit rate degradation)
- Add troubleshooting entries for new metric categories

Key modified files:

OpenTelemetryPlan/09-data-collection-reference.md
docs/telemetry-runbook.md

Task 9.10: Integration Tests

Objective: Verify all new metrics appear in Prometheus after a test workload.

What to do:

Extend the existing telemetry integration test:
- Start rippled with [telemetry] enabled=1 and [insight] server=otel
- Submit a batch of RPC calls and transactions
- Query Prometheus for each new metric family
- Assert non-zero values for: NodeStore reads, cache hit rates, TxQ count, PerfLog RPC counters, object counts, load factors
Add unit tests for the MetricsRegistry class:
- Verify callback registration and deregistration
- Verify metric values match get_counts JSON output
- Verify graceful behavior when telemetry is disabled

Key modified files:

src/test/telemetry/MetricsRegistry_test.cpp (new)
Existing integration test script (extend assertions)

Task 9.11: Validator Health Dashboard (External Dashboard Parity)

Source: External Dashboard Parity — dashboards for Phase 7 metrics inspired by the community xrpl-validator-dashboard.

Upstream: Phase 7 Tasks 7.9-7.16 (metrics must be emitting). Downstream: Phase 10 (dashboard load checks), Phase 11 (alert rules reference these panels).

Objective: Create a Grafana dashboard for validation agreement, amendment/UNL health, and state tracking.

Dashboard: rippled-validator-health.json

Panel	Type	PromQL
Agreement % (1h)	stat	`rippled_validation_agreement{metric="agreement_pct_1h"}`
Agreement % (24h)	stat	`rippled_validation_agreement{metric="agreement_pct_24h"}`
Agreements vs Missed (1h)	bargauge	`agreements_1h` and `missed_1h` side by side
Agreements vs Missed (24h)	bargauge	`agreements_24h` and `missed_24h` side by side
Validation Rate	stat	`rate(rippled_validations_sent_total[5m]) * 60`
Validations Checked Rate	stat	`rate(rippled_validations_checked_total[5m]) * 60`
Amendment Blocked	stat	`rippled_validator_health{metric="amendment_blocked"}`
UNL Expiry (days)	stat	`rippled_validator_health{metric="unl_expiry_days"}`
Validation Quorum	stat	`rippled_validator_health{metric="validation_quorum"}`
State Value Timeline	timeseries	`rippled_state_tracking{metric="state_value"}`
Time in Current State	stat	`rippled_state_tracking{metric="time_in_current_state_seconds"}`
State Changes Rate	stat	`rate(rippled_state_changes_total[1h])`
Ledgers Closed Rate	stat	`rate(rippled_ledgers_closed_total[5m]) * 60`

Dashboard conventions: $node template variable for exported_instance filtering, dark theme, matching existing panel sizes and color schemes.

Key new files: docker/telemetry/grafana/dashboards/rippled-validator-health.json

Exit Criteria:

All 13 panels render with non-zero data during normal operation
$node filter works correctly for multi-node deployments
Amendment blocked and UNL expiry panels use color thresholds (red=blocked/expiring)

Task 9.12: Peer Quality Dashboard (External Dashboard Parity)

Source: External Dashboard Parity

Objective: Create a Grafana dashboard for peer health aggregates.

Dashboard: rippled-peer-quality.json

Panel	Type	PromQL
P90 Peer Latency	timeseries	`rippled_peer_quality{metric="peer_latency_p90_ms"}`
Insane/Diverged Peers	stat	`rippled_peer_quality{metric="peers_insane_count"}`
Higher Version Peers %	stat	`rippled_peer_quality{metric="peers_higher_version_pct"}`
Upgrade Recommended	stat	`rippled_peer_quality{metric="upgrade_recommended"}`
Resource Disconnects	timeseries	`rippled_Overlay_Peer_Disconnects_Charges`
Inbound vs Outbound	bargauge	`rippled_Peer_Finder_Active_Inbound_Peers`, `..._Outbound_Peers`

Key new files: docker/telemetry/grafana/dashboards/rippled-peer-quality.json

Exit Criteria:

All 6 panels render correctly
P90 latency panel shows trend over time
Upgrade recommended panel uses color threshold (red=1, green=0)

Task 9.13: Ledger Economy Dashboard Panels (External Dashboard Parity)

Source: External Dashboard Parity

Objective: Add "Ledger Economy" row to the existing system-node-health.json dashboard.

Panel	Type	PromQL
Base Fee (drops)	stat	`rippled_ledger_economy{metric="base_fee_xrp"}`
Reserve Base (drops)	stat	`rippled_ledger_economy{metric="reserve_base_xrp"}`
Reserve Inc (drops)	stat	`rippled_ledger_economy{metric="reserve_inc_xrp"}`
Ledger Age	stat	`rippled_ledger_economy{metric="ledger_age_seconds"}`
Transaction Rate	timeseries	`rippled_ledger_economy{metric="transaction_rate"}`

Key modified files: docker/telemetry/grafana/dashboards/system-node-health.json

Exit Criteria:

5 new panels render correctly in existing dashboard
Fee values match server_info RPC output
Transaction rate shows smooth trend (not spiky)

Exit Criteria

All ~50 new metrics visible in Prometheus via OTLP pipeline
MetricsRegistry class registers/deregisters cleanly with OTel SDK
Async gauge callbacks execute at 10s intervals without performance impact
2 new Grafana dashboards operational (Fee Market, Job Queue)
2 existing dashboards updated with new panel groups
Integration test validates all new metric families are non-zero
No performance regression (< 0.5% CPU overhead from new callbacks)
Documentation updated with full new metric inventory
Validator Health dashboard renders all 13 panels
Peer Quality dashboard renders all 6 panels
Ledger Economy panels added to system-node-health dashboard

21 KiB Raw Blame History Unescape Escape

Phase 9: Internal Metric Instrumentation Gap Fill — Task List

Related Plan Documents

Third-Party Consumer Context

Task 9.1: NodeStore I/O Metrics

Task 9.2: Cache Hit Rate Metrics

Task 9.3: Transaction Queue (TxQ) Metrics

Task 9.4: PerfLog Per-RPC Method Metrics

Task 9.5: PerfLog Per-Job-Type Metrics

Task 9.6: Counted Object Instance Metrics

Task 9.7: Fee Escalation & Load Factor Metrics

Task 9.7a: push_metrics.py Parity — Missing Observable Gauges

Task 9.8: New Grafana Dashboards

Task 9.9: Update Documentation

Task 9.10: Integration Tests

Task 9.11: Validator Health Dashboard (External Dashboard Parity)

Task 9.12: Peer Quality Dashboard (External Dashboard Parity)

Task 9.13: Ledger Economy Dashboard Panels (External Dashboard Parity)

Exit Criteria

21 KiB

Raw Blame History