Files
rippled/OpenTelemetryPlan/Phase9_taskList.md
Pratik Mankawde 1efedb2fe0 Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines
- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00

14 KiB
Raw Blame History

Phase 9: Internal Metric Instrumentation Gap Fill — Task List

Status: Future Enhancement

Goal: Instrument rippled to emit ~50+ metrics that exist in get_counts/server_info/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines.

Scope: Hybrid approach — extend beast::insight for metrics near existing registrations, use OTel Metrics SDK ObservableGauge callbacks for new categories (TxQ, PerfLog, CountedObjects).

Branch: pratik/otel-phase9-metric-gap-fill (from pratik/otel-phase8-log-correlation)

Depends on: Phase 7 (native OTel metrics pipeline) and Phase 8 (log-trace correlation)

Document Relevance
06-implementation-phases.md Phase 9 plan: motivation, architecture, exit criteria (§6.8.2)
09-data-collection-reference.md Current metric inventory + future metrics section
Phase7_taskList.md Prerequisite — OTel Metrics SDK and OTelCollector class
Phase8_taskList.md Prerequisite — log-trace correlation

Third-Party Consumer Context

These metrics serve multiple external consumer categories identified during research:

Consumer Category Key Metrics They Need
Exchanges Fee escalation levels, TxQ depth, settlement latency
Payment Processors Load factors, io_latency, transaction throughput
Analytics Providers NodeStore I/O, cache hit rates, counted objects
Validators/Operators Per-job execution times, PerfLog RPC counters, consensus timing
Academic Researchers Consensus performance time-series, fee market dynamics
Institutional Custody Server health scores, reserve calculations, node availability

Task 9.1: NodeStore I/O Metrics

Objective: Export node store read/write performance as time-series metrics.

What to do:

  • In src/libxrpl/nodestore/Database.cpp, extend existing beast::insight registrations to add:

    • Gauge: node_reads_total (cumulative read operations)
    • Gauge: node_reads_hit (cache-served reads)
    • Gauge: node_writes (cumulative write operations)
    • Gauge: node_written_bytes (cumulative bytes written)
    • Gauge: node_read_bytes (cumulative bytes read)
    • Gauge: node_reads_duration_us (cumulative read time in microseconds)
    • Gauge: write_load (current write load score)
    • Gauge: read_queue (items in read queue)
  • These values are already computed in Database::getCountsJson() (line ~236). Wire the same counters to beast::insight hooks.

Key modified files:

  • src/libxrpl/nodestore/Database.cpp
  • src/libxrpl/nodestore/Database.h (add insight members)

Derived Prometheus metrics: rippled_nodestore_reads_total, rippled_nodestore_reads_hit, rippled_nodestore_write_load, etc.

Grafana dashboard: Add "NodeStore I/O" panel group to Node Health dashboard.


Task 9.2: Cache Hit Rate Metrics

Objective: Export SHAMap and ledger cache performance as time-series gauges.

What to do:

  • Register OTel ObservableGauge callbacks (via Phase 7's OTelCollector) for:

    • SLE_hit_rate — SLE cache hit rate (0.01.0)
    • ledger_hit_rate — Ledger object cache hit rate
    • AL_hit_rate — AcceptedLedger cache hit rate
    • treenode_cache_size — SHAMap TreeNode cache size (entries)
    • treenode_track_size — Tracked tree nodes
    • fullbelow_size — FullBelow cache size
  • The callback should read from the same sources as GetCounts.cpp handler (line ~43).

  • Create a centralized MetricsRegistry class that holds all OTel async gauge registrations, polled at 10-second intervals by the PeriodicMetricReader.

Key modified files:

  • New: src/xrpld/telemetry/MetricsRegistry.h / .cpp
  • src/xrpld/rpc/handlers/GetCounts.cpp (extract shared access methods)
  • src/xrpld/app/main/Application.cpp (register MetricsRegistry at startup)

Derived Prometheus metrics: rippled_cache_SLE_hit_rate, rippled_cache_ledger_hit_rate, rippled_cache_treenode_size, etc.


Task 9.3: Transaction Queue (TxQ) Metrics

Objective: Export TxQ depth, capacity, and fee escalation levels as time-series.

What to do:

  • Register OTel ObservableGauge callbacks for TxQ state (from TxQ.h line ~143):

    • txq_count — Current transactions in queue
    • txq_max_size — Maximum queue capacity
    • txq_in_ledger — Transactions in current open ledger
    • txq_per_ledger — Expected transactions per ledger
    • txq_reference_fee_level — Reference fee level
    • txq_min_processing_fee_level — Minimum fee to get processed
    • txq_med_fee_level — Median fee level in queue
    • txq_open_ledger_fee_level — Open ledger fee escalation level
  • Add to the MetricsRegistry (Task 9.2).

Key modified files:

  • src/xrpld/telemetry/MetricsRegistry.cpp (add TxQ callbacks)
  • src/xrpld/app/tx/detail/TxQ.h (expose metrics accessor if needed)

Derived Prometheus metrics: rippled_txq_count, rippled_txq_max_size, rippled_txq_open_ledger_fee_level, etc.

Grafana dashboard: New Fee Market & TxQ dashboard (rippled-fee-market).


Task 9.4: PerfLog Per-RPC Method Metrics

Objective: Export per-RPC-method call counts and latency as OTel metrics.

What to do:

  • Register OTel instruments for PerfLog RPC counters (from PerfLogImp.cpp line ~63):

    • Counter: rpc_method_started_total{method="<name>"} — calls started
    • Counter: rpc_method_finished_total{method="<name>"} — calls completed
    • Counter: rpc_method_errored_total{method="<name>"} — calls errored
    • Histogram: rpc_method_duration_us{method="<name>"} — execution time distribution
  • Use OTel Counter<int64_t> and Histogram<double> instruments with method attribute label.

  • Hook into the existing PerfLog callback mechanism rather than adding new instrumentation points.

Key modified files:

  • src/xrpld/perflog/detail/PerfLogImp.cpp (add OTel instrument updates alongside existing JSON counters)
  • src/xrpld/telemetry/MetricsRegistry.cpp (register instruments)

Derived Prometheus metrics: rippled_rpc_method_started_total{method="server_info"}, rippled_rpc_method_duration_us_bucket{method="ledger"}, etc.

Grafana dashboard: Add "Per-Method RPC Breakdown" panel group to RPC Performance dashboard.


Task 9.5: PerfLog Per-Job-Type Metrics

Objective: Export per-job-type queue and execution metrics.

What to do:

  • Register OTel instruments for PerfLog job counters:

    • Counter: job_queued_total{job_type="<name>"} — jobs queued
    • Counter: job_started_total{job_type="<name>"} — jobs started
    • Counter: job_finished_total{job_type="<name>"} — jobs completed
    • Histogram: job_queued_duration_us{job_type="<name>"} — time spent waiting in queue
    • Histogram: job_running_duration_us{job_type="<name>"} — execution time distribution
  • Hook into PerfLog's existing job tracking alongside Task 9.4.

Key modified files:

  • src/xrpld/perflog/detail/PerfLogImp.cpp
  • src/xrpld/telemetry/MetricsRegistry.cpp

Derived Prometheus metrics: rippled_job_queued_total{job_type="ledgerData"}, rippled_job_running_duration_us_bucket{job_type="transaction"}, etc.

Grafana dashboard: New Job Queue Analysis dashboard (rippled-job-queue).


Task 9.6: Counted Object Instance Metrics

Objective: Export live instance counts for key internal object types.

What to do:

  • Register OTel ObservableGauge callbacks for CountedObject<T> instance counts:

    • object_count{type="Transaction"} — live Transaction objects
    • object_count{type="Ledger"} — live Ledger objects
    • object_count{type="NodeObject"} — live NodeObject instances
    • object_count{type="STTx"} — serialized transaction objects
    • object_count{type="STLedgerEntry"} — serialized ledger entries
    • object_count{type="InboundLedger"} — ledgers being fetched
    • object_count{type="Pathfinder"} — active pathfinding computations
    • object_count{type="PathRequest"} — active path requests
    • object_count{type="HashRouterEntry"} — hash router entries
  • The CountedObject template already tracks these via atomic counters. The callback just reads the current counts.

Key modified files:

  • src/xrpld/telemetry/MetricsRegistry.cpp (add counted object callbacks)
  • include/xrpl/basics/CountedObject.h (may need static accessor for iteration)

Derived Prometheus metrics: rippled_object_count{type="Transaction"}, rippled_object_count{type="NodeObject"}, etc.

Grafana dashboard: Add "Object Instance Counts" panel to Node Health dashboard.


Task 9.7: Fee Escalation & Load Factor Metrics

Objective: Export the full load factor breakdown as time-series.

What to do:

  • Register OTel ObservableGauge callbacks for load factors (from NetworkOPs.cpp line ~2694):

    • load_factor — combined transaction cost multiplier
    • load_factor_server — server + cluster + network contribution
    • load_factor_local — local server load only
    • load_factor_net — network-wide load estimate
    • load_factor_cluster — cluster peer load
    • load_factor_fee_escalation — open ledger fee escalation
    • load_factor_fee_queue — queue entry fee level
  • These overlap with some existing StatsD metrics but provide finer granularity (individual factor breakdown vs. combined value).

Key modified files:

  • src/xrpld/telemetry/MetricsRegistry.cpp
  • src/xrpld/app/misc/NetworkOPs.cpp (expose load factor accessors if needed)

Derived Prometheus metrics: rippled_load_factor, rippled_load_factor_fee_escalation, etc.

Grafana dashboard: Add "Load Factor Breakdown" panel to Fee Market & TxQ dashboard.


Task 9.8: New Grafana Dashboards

Objective: Create Grafana dashboards for the new metric categories.

What to do:

  • Create 2 new dashboards:

    1. Fee Market & TxQ (rippled-fee-market) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline
    2. Job Queue Analysis (rippled-job-queue) — Per-job-type rates, queue wait times, execution times, job queue depth
  • Update 2 existing dashboards:

    1. Node Health (rippled-statsd-node-health) — Add NodeStore I/O panels, cache hit rate panels, object instance counts
    2. RPC Performance (rippled-rpc-perf) — Add per-method RPC breakdown panels

Key modified files:

  • New: docker/telemetry/grafana/dashboards/rippled-fee-market.json
  • New: docker/telemetry/grafana/dashboards/rippled-job-queue.json
  • docker/telemetry/grafana/dashboards/rippled-statsd-node-health.json
  • docker/telemetry/grafana/dashboards/rippled-rpc-perf.json

Task 9.9: Update Documentation

Objective: Update telemetry reference docs with all new metrics.

What to do:

  • Update OpenTelemetryPlan/09-data-collection-reference.md:

    • Add new section for OTel SDK-exported metrics (NodeStore, cache, TxQ, PerfLog, CountedObjects, load factors)
    • Update Grafana dashboard reference table (add 2 new dashboards)
    • Add Prometheus query examples for new metrics
  • Update docs/telemetry-runbook.md:

    • Add alerting rules for new metrics (NodeStore write_load, TxQ capacity, cache hit rate degradation)
    • Add troubleshooting entries for new metric categories

Key modified files:

  • OpenTelemetryPlan/09-data-collection-reference.md
  • docs/telemetry-runbook.md

Task 9.10: Integration Tests

Objective: Verify all new metrics appear in Prometheus after a test workload.

What to do:

  • Extend the existing telemetry integration test:

    • Start rippled with [telemetry] enabled=1 and [insight] server=otel
    • Submit a batch of RPC calls and transactions
    • Query Prometheus for each new metric family
    • Assert non-zero values for: NodeStore reads, cache hit rates, TxQ count, PerfLog RPC counters, object counts, load factors
  • Add unit tests for the MetricsRegistry class:

    • Verify callback registration and deregistration
    • Verify metric values match get_counts JSON output
    • Verify graceful behavior when telemetry is disabled

Key modified files:

  • src/test/telemetry/MetricsRegistry_test.cpp (new)
  • Existing integration test script (extend assertions)

Effort Summary

Task Description Effort Risk
9.1 NodeStore I/O metrics 1d Low
9.2 Cache hit rate metrics + MetricsRegistry 2d Medium
9.3 TxQ metrics 1d Low
9.4 PerfLog per-RPC metrics 1.5d Medium
9.5 PerfLog per-job metrics 1d Low
9.6 Counted object instance metrics 0.5d Low
9.7 Fee escalation & load factor metrics 0.5d Low
9.8 New Grafana dashboards 2d Low
9.9 Update documentation 1d Low
9.10 Integration tests 1.5d Medium

Total Effort: 12 days

Exit Criteria

  • All ~50 new metrics visible in Prometheus via OTLP pipeline
  • MetricsRegistry class registers/deregisters cleanly with OTel SDK
  • Async gauge callbacks execute at 10s intervals without performance impact
  • 2 new Grafana dashboards operational (Fee Market, Job Queue)
  • 2 existing dashboards updated with new panel groups
  • Integration test validates all new metric families are non-zero
  • No performance regression (< 0.5% CPU overhead from new callbacks)
  • Documentation updated with full new metric inventory