Files
rippled/OpenTelemetryPlan/Phase7_taskList.md
Pratik Mankawde fe10835a7c docs: add Tasks 7.9-7.16 for external dashboard parity metrics
Adds ValidationTracker (agreement computation with 8s grace period),
validator health, peer quality, ledger economy, state tracking,
storage detail gauges, 7 synchronous counters, and agreement gauge.

29 new metrics covering validation agreement, peer quality, UNL health,
ledger economy, state tracking, and upgrade awareness.

Part of the external dashboard parity initiative across phases 2-11.
See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:39 +01:00

24 KiB

Phase 7: Native OTel Metrics Migration — Task List

Goal: Replace StatsDCollector with a native OpenTelemetry Metrics SDK implementation behind the existing beast::insight::Collector interface, eliminating the StatsD UDP dependency.

Scope: New OTelCollectorImpl class, CollectorManager config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.

Branch: pratik/otel-phase7-native-metrics (from pratik/otel-phase6-statsd)

Document Relevance
06-implementation-phases.md Phase 7 plan: motivation, architecture, exit criteria (§6.8)
02-design-decisions.md Collector interface design, beast::insight coexistence strategy
05-configuration-reference.md [insight] and [telemetry] config sections
09-data-collection-reference.md Complete metric inventory that must be preserved

Task 7.1: Add OTel Metrics SDK to Build Dependencies

Objective: Enable the OTel C++ Metrics SDK components in the build system.

What to do:

  • Edit conanfile.py:

    • Add OTel metrics SDK components to the dependency list when telemetry=True
    • Components needed: opentelemetry-cpp::metrics, opentelemetry-cpp::otlp_http_metric_exporter
  • Edit CMakeLists.txt (telemetry section):

    • Link opentelemetry::metrics and opentelemetry::otlp_http_metric_exporter targets

Key modified files:

  • conanfile.py
  • CMakeLists.txt (or the relevant telemetry cmake target)

Reference: 05-configuration-reference.md §5.3 — CMake integration


Task 7.2: Implement OTelCollector Class

Objective: Create the core OTelCollector implementation that maps beast::insight instruments to OTel Metrics SDK instruments.

What to do:

  • Create include/xrpl/beast/insight/OTelCollector.h:

    • Public factory: static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)
    • Derives from StatsDCollector (or directly from Collector — TBD based on shared code)
  • Create src/libxrpl/beast/insight/OTelCollector.cpp (~400-500 lines):

    • OTelCounterImpl: Wraps opentelemetry::metrics::Counter<int64_t>. increment(amount) calls counter->Add(amount).
    • OTelGaugeImpl: Uses opentelemetry::metrics::ObservableGauge<uint64_t> with an async callback. set(value) stores value atomically; callback reads it during collection.
    • OTelMeterImpl: Wraps opentelemetry::metrics::Counter<uint64_t>. increment(amount) calls counter->Add(amount). Semantically identical to Counter but unsigned.
    • OTelEventImpl: Wraps opentelemetry::metrics::Histogram<double>. notify(duration) calls histogram->Record(duration.count()). Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms.
    • OTelHookImpl: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
    • OTelCollectorImp: Main class.
      • Creates MeterProvider with PeriodicMetricReader (1s export interval)
      • Creates OtlpHttpMetricExporter pointing to [telemetry] endpoint
      • Sets resource attributes (service.name, service.instance.id) matching trace exporter
      • Implements all make_*() factory methods
      • Prefixes metric names with [insight] prefix= value
  • Guard all OTel SDK includes with #ifdef XRPL_ENABLE_TELEMETRY to compile to NullCollector equivalents when telemetry disabled.

Key new files:

  • include/xrpl/beast/insight/OTelCollector.h
  • src/libxrpl/beast/insight/OTelCollector.cpp

Key patterns to follow:

  • Match StatsDCollector.cpp structure: private impl classes, intrusive list for metrics, strand-based thread safety
  • Match existing telemetry code style from src/libxrpl/telemetry/Telemetry.cpp
  • Use RAII for MeterProvider lifecycle (shutdown on destructor)

Reference: 04-code-samples.md — code style and patterns


Task 7.3: Update CollectorManager

Objective: Add server=otel config option to route metric creation to the new OTel backend.

What to do:

  • Edit src/xrpld/app/main/CollectorManager.cpp:

    • In the constructor, add a third branch after server == "statsd":
      else if (server == "otel")
      {
          // Read endpoint from [telemetry] section
          auto const endpoint = get(telemetryParams, "endpoint",
              "http://localhost:4318/v1/metrics");
          std::string const& prefix(get(params, "prefix"));
          m_collector = beast::insight::OTelCollector::New(
              endpoint, prefix, journal);
      }
      
    • This requires access to the [telemetry] config section — may need to pass it as a parameter or read from Application config.
  • Edit src/xrpld/app/main/CollectorManager.h:

    • Add #include <xrpl/beast/insight/OTelCollector.h>

Key modified files:

  • src/xrpld/app/main/CollectorManager.cpp
  • src/xrpld/app/main/CollectorManager.h

Task 7.4: Update OTel Collector Configuration

Objective: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.

What to do:

  • Edit docker/telemetry/otel-collector-config.yaml:

    • Remove statsd receiver (no longer needed when server=otel)
    • Add metrics pipeline under service.pipelines:
      metrics:
        receivers: [otlp, spanmetrics]
        processors: [batch]
        exporters: [prometheus]
      
    • The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
    • Keep spanmetrics connector in the metrics pipeline so span-derived RED metrics continue working.
  • Edit docker/telemetry/docker-compose.yml:

    • Remove UDP :8125 port mapping from otel-collector service
    • Update rippled service config: change [insight] server=statsd to server=otel

Key modified files:

  • docker/telemetry/otel-collector-config.yaml
  • docker/telemetry/docker-compose.yml

Note: Keep a commented-out statsd receiver block for operators who need backward compatibility.


Task 7.5: Preserve Metric Names in Prometheus

Objective: Ensure existing Grafana dashboards continue working with identical metric names.

What to do:

  • In OTelCollector.cpp, construct OTel instrument names to match existing Prometheus metric names:

    • beast::insight make_gauge("LedgerMaster", "Validated_Ledger_Age") → OTel instrument name: rippled_LedgerMaster_Validated_Ledger_Age
    • The prefix + group + name concatenation must produce the same string as StatsDCollector's format
    • Use underscores as separators (matching StatsD convention)
  • Verify in integration test that key Prometheus queries still return data:

    • rippled_LedgerMaster_Validated_Ledger_Age
    • rippled_Peer_Finder_Active_Inbound_Peers
    • rippled_rpc_requests

Key consideration: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds _total suffix to counters and converts dots to underscores — match existing conventions.


Task 7.6: Update Grafana Dashboards

Objective: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.

What to do:

  • If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
  • If OTLP export produces different names (e.g., _total suffix on counters), update:
    • docker/telemetry/grafana/dashboards/statsd-node-health.json
    • docker/telemetry/grafana/dashboards/statsd-network-traffic.json
    • docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
  • Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).

Key modified files:

  • docker/telemetry/grafana/dashboards/statsd-*.json (3 files, conditionally)

Task 7.7: Update Integration Tests

Objective: Verify the full OTLP metrics pipeline end-to-end.

What to do:

  • Edit docker/telemetry/integration-test.sh:
    • Update test config to use [insight] server=otel
    • Verify metrics arrive in Prometheus via OTLP (not StatsD)
    • Add check that StatsD receiver is no longer required
    • Preserve all existing metric presence checks

Key modified files:

  • docker/telemetry/integration-test.sh

Task 7.8: Update Documentation

Objective: Update all plan docs, runbook, and reference docs to reflect the migration.

What to do:

  • Edit docs/telemetry-runbook.md:

    • Update [insight] config examples to show server=otel
    • Update troubleshooting section (no more StatsD UDP debugging)
  • Edit OpenTelemetryPlan/09-data-collection-reference.md:

    • Update Data Flow Overview diagram (remove StatsD receiver)
    • Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
    • Update config examples
  • Edit OpenTelemetryPlan/05-configuration-reference.md:

    • Add server=otel option to [insight] section docs
  • Edit docker/telemetry/TESTING.md:

    • Update setup instructions to use server=otel

Key modified files:

  • docs/telemetry-runbook.md
  • OpenTelemetryPlan/09-data-collection-reference.md
  • OpenTelemetryPlan/05-configuration-reference.md
  • docker/telemetry/TESTING.md

Task 7.9: ValidationTracker — Validation Agreement Computation

Source: External Dashboard Parity — the most valuable metric from the community xrpl-validator-dashboard.

Upstream: Phase 4 Task 4.8 (validation span attributes provide ledger hash context). Downstream: Phase 9 (Validator Health dashboard), Phase 10 (validation checks), Phase 11 (agreement alert rules).

Objective: Implement a stateful class that tracks whether our validator's validations agree with network consensus, maintaining rolling 1h and 24h windows with an 8-second grace period and 5-minute late repair window.

Architecture:

consensus.validation.send ────> ValidationTracker ────> MetricsRegistry
(records our validation         (reconciles after       (exports agreement
 for ledger X)                   8s grace period)        gauges every 10s)

ledger.validate ──────────────> ValidationTracker
(records which ledger           (marks ledger X as
 network validated)              agreed or missed)

What to do:

  • Create src/xrpld/telemetry/ValidationTracker.h:

    • recordOurValidation(ledgerHash, ledgerSeq) — called when we send a validation
    • recordNetworkValidation(ledgerHash, seq) — called when a ledger is fully validated
    • reconcile() — called periodically; reconciles pending ledger events after 8s grace period
    • Getters: agreementPct1h(), agreementPct24h(), agreements1h(), missed1h(), agreements24h(), missed24h(), totalAgreements(), totalMissed(), totalValidationsSent(), totalValidationsChecked()
    • Thread-safety: atomics for counters, mutex for window deques
  • Create src/xrpld/telemetry/detail/ValidationTracker.cpp:

    • Reconciliation logic: after 8s grace period, check if weValidated && networkValidated && sameHash → agreement; else missed
    • Late repair: if a late validation arrives within 5 minutes, correct a false-positive miss
    • Sliding window: std::deque<WindowEvent> evicts entries older than 1h/24h on each reconciliation pass
    • Ring buffer of 1000 LedgerEvent structs for pending reconciliation
  • Add recording hooks (modifying Phase 4 code from Phase 7 branch):

    • RCLConsensus.cpp validate(): call tracker.recordOurValidation()
    • LedgerMaster.cpp fully-validated path: call tracker.recordNetworkValidation()

Key data structures:

struct LedgerEvent {
    uint256 ledgerHash;
    LedgerIndex seq;
    TimePoint closeTime;
    bool weValidated = false;
    bool networkValidated = false;
    bool reconciled = false;
    bool agreed = false;
};

struct WindowEvent {
    TimePoint time;
    bool agreed;
};

Key new files:

  • src/xrpld/telemetry/ValidationTracker.h
  • src/xrpld/telemetry/detail/ValidationTracker.cpp

Key modified files:

  • src/xrpld/telemetry/MetricsRegistry.h (add ValidationTracker member)
  • src/xrpld/telemetry/MetricsRegistry.cpp (add gauge callback reading from tracker)
  • src/xrpld/app/consensus/RCLConsensus.cpp (add recording hooks)
  • src/xrpld/app/ledger/detail/LedgerMaster.cpp (add recording hook)

Exit Criteria:

  • ValidationTracker correctly tracks agreement with 8s grace period
  • 5-minute late repair corrects false-positive misses
  • Thread-safe (atomics + mutex for window deques)
  • Rolling windows correctly evict stale entries
  • Unit tests: normal agreement, missed validation, late repair, window eviction

Task 7.10: Validator Health Observable Gauges

Source: External Dashboard Parity

Objective: Export amendment blocked, UNL health, and quorum data as a native OTel observable gauge.

What to do:

  • In MetricsRegistry.cpp registerAsyncGauges(), add:
validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
    "rippled_validator_health", "Validator health indicators");

Gauge label values:

Label metric= Type Source
amendment_blocked int64 app_.getOPs().isAmendmentBlocked() → 0/1
unl_blocked int64 app_.getOPs().isUNLBlocked() → 0/1
unl_expiry_days double app_.validators().expires() → days until expiry
validation_quorum int64 app_.validators().quorum()

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp

Exit Criteria:

  • All 4 label values emitted every 10s
  • unl_expiry_days is negative when expired, positive when active
  • Values visible in Prometheus

Task 7.11: Peer Quality Observable Gauges

Source: External Dashboard Parity

Objective: Export peer health aggregates (latency P90, insane peers, version awareness) as a native OTel observable gauge.

What to do:

  • In MetricsRegistry.cpp registerAsyncGauges(), add a callback that iterates app_.overlay().foreach(...) to:
    • Collect per-peer latency values, sort, compute P90
    • Count peers with tracking_ == diverged (insane)
    • Compare peer getVersion() to own version for upgrade awareness

Gauge label values:

Label metric= Type Source
peer_latency_p90_ms double P90 from sorted peer latencies
peers_insane_count int64 Peers with diverged tracking status
peers_higher_version_pct double % of peers on newer rippled version
upgrade_recommended int64 1 if peers_higher_version_pct > 60%

Implementation note: The callback runs every 10s on the metrics reader thread. Iterating ~50-200 peers is acceptable overhead.

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp

Exit Criteria:

  • P90 latency computed correctly
  • Insane count matches peers RPC output
  • Version comparison handles format variations (e.g., "rippled-2.4.0-rc1")

Task 7.12: Ledger Economy Observable Gauges

Source: External Dashboard Parity

Objective: Export fee, reserve, ledger age, and transaction rate as a native OTel observable gauge.

Gauge label values:

Label metric= Type Source
base_fee_xrp double Base fee from validated ledger fee settings (drops)
reserve_base_xrp double Account reserve from validated ledger (drops)
reserve_inc_xrp double Owner reserve increment (drops)
ledger_age_seconds double now - lastValidatedCloseTime
transaction_rate double Derived: tx count delta / time delta (smoothed)

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp

Exit Criteria:

  • Fee values match server_info RPC output
  • ledger_age_seconds increases monotonically between ledger closes
  • transaction_rate is smoothed (rolling average)

Task 7.13: State Tracking Observable Gauges

Source: External Dashboard Parity

Objective: Export extended state value (0-6 encoding combining OperatingMode + ConsensusMode) and time-in-current-state.

Gauge label values:

Label metric= Type Source
state_value int64 0-6 encoding (see spec for mapping)
time_in_current_state_seconds double now - lastModeChangeTime from StateAccounting

State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (full + validating), 6=proposing (full + proposing).

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp

Exit Criteria:

  • state_value correctly combines OperatingMode and ConsensusMode
  • time_in_current_state_seconds resets on mode change

Task 7.14: Storage Detail and Sync Info Gauges

Source: External Dashboard Parity

Objective: Export NuDB-specific storage size and initial sync duration.

Gauge label values:

Gauge Name Label metric= Type Source
rippled_storage_detail nudb_bytes int64 NuDB backend file size
rippled_sync_info initial_sync_duration_seconds double Time from start to first FULL

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp

Exit Criteria:

  • NuDB file size reported in bytes (0 if NuDB not configured)
  • Sync duration captured once and remains stable after reaching FULL

Task 7.15: New Synchronous Counters

Source: External Dashboard Parity

Objective: Add 7 new event counters incremented at their respective instrumentation sites.

Counter Name Increment Site Source File
rippled_ledgers_closed_total onAccept() in consensus RCLConsensus.cpp
rippled_validations_sent_total validate() in consensus RCLConsensus.cpp
rippled_validations_checked_total Network validation received LedgerMaster.cpp
rippled_validation_agreements_total ValidationTracker reconciliation ValidationTracker.cpp
rippled_validation_missed_total ValidationTracker reconciliation ValidationTracker.cpp
rippled_state_changes_total setMode() in NetworkOPs NetworkOPs.cpp
rippled_jq_trans_overflow_total Job queue overflow path JobQueue.cpp

Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp (declarations), plus recording sites in RCLConsensus.cpp, LedgerMaster.cpp, NetworkOPs.cpp, JobQueue.cpp

Exit Criteria:

  • All 7 counters monotonically increase during normal operation
  • Counter values match expected rates (e.g., ledgers_closed ≈ 1 per 3-5s)

Task 7.16: Validation Agreement Observable Gauge

Source: External Dashboard Parity

Objective: Export rolling window agreement stats from ValidationTracker (Task 7.9).

Gauge label values:

Gauge Name Label metric= Type Source
rippled_validation_agreement agreement_pct_1h double tracker.agreementPct1h()
agreements_1h int64 tracker.agreements1h()
missed_1h int64 tracker.missed1h()
agreement_pct_24h double tracker.agreementPct24h()
agreements_24h int64 tracker.agreements24h()
missed_24h int64 tracker.missed24h()

Key modified files: src/xrpld/telemetry/MetricsRegistry.cpp

Exit Criteria:

  • Agreement percentages in range [0.0, 100.0]
  • Window stats stabilize after 1h/24h of operation

Summary Table

Task Description New Files Modified Files Depends On
7.1 Add OTel Metrics SDK to build deps 0 2
7.2 Implement OTelCollector class 2 0 7.1
7.3 Update CollectorManager config routing 0 2 7.2
7.4 Update OTel Collector YAML and Docker 0 2 7.3
7.5 Preserve metric names in Prometheus 0 1 7.2
7.6 Update Grafana dashboards (if needed) 0 3 7.5
7.7 Update integration tests 0 1 7.4
7.8 Update documentation 0 4 7.6
7.9 ValidationTracker (agreement tracking) 2 4 7.2, P4.8
7.10 Validator health observable gauges 0 2 7.2
7.11 Peer quality observable gauges 0 2 7.2
7.12 Ledger economy observable gauges 0 2 7.2
7.13 State tracking observable gauges 0 2 7.2
7.14 Storage detail and sync info gauges 0 2 7.2
7.15 New synchronous counters 0 6 7.2
7.16 Validation agreement observable gauge 0 1 7.9

Parallel work: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6. Tasks 7.10-7.14 can all run in parallel after 7.2. Task 7.15 depends on 7.2. Task 7.16 depends on 7.9. Task 7.9 depends on 7.2 and Phase 4 Task 4.8.

Exit Criteria (from 06-implementation-phases.md §6.8):

  • All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
  • server=otel is the default in development docker-compose
  • server=statsd still works as a fallback
  • Existing Grafana dashboards display data correctly
  • Integration test passes with OTLP-only metrics pipeline
  • No performance regression vs StatsD baseline (< 1% CPU overhead)
  • Deferred Task 6.1 (|m wire format) no longer relevant — Meter mapped to OTel Counter
  • ValidationTracker agreement % stabilizes after 1h under normal consensus
  • All new gauges and counters visible in Prometheus with non-zero values