Adds ValidationTracker (agreement computation with 8s grace period), validator health, peer quality, ledger economy, state tracking, storage detail gauges, 7 synchronous counters, and agreement gauge. 29 new metrics covering validation agreement, peer quality, UNL health, ledger economy, state tracking, and upgrade awareness. Part of the external dashboard parity initiative across phases 2-11. See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
24 KiB
Phase 7: Native OTel Metrics Migration — Task List
Goal: Replace
StatsDCollectorwith a native OpenTelemetry Metrics SDK implementation behind the existingbeast::insight::Collectorinterface, eliminating the StatsD UDP dependency.Scope: New
OTelCollectorImplclass,CollectorManagerconfig change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.Branch:
pratik/otel-phase7-native-metrics(frompratik/otel-phase6-statsd)
Related Plan Documents
| Document | Relevance |
|---|---|
| 06-implementation-phases.md | Phase 7 plan: motivation, architecture, exit criteria (§6.8) |
| 02-design-decisions.md | Collector interface design, beast::insight coexistence strategy |
| 05-configuration-reference.md | [insight] and [telemetry] config sections |
| 09-data-collection-reference.md | Complete metric inventory that must be preserved |
Task 7.1: Add OTel Metrics SDK to Build Dependencies
Objective: Enable the OTel C++ Metrics SDK components in the build system.
What to do:
-
Edit
conanfile.py:- Add OTel metrics SDK components to the dependency list when
telemetry=True - Components needed:
opentelemetry-cpp::metrics,opentelemetry-cpp::otlp_http_metric_exporter
- Add OTel metrics SDK components to the dependency list when
-
Edit
CMakeLists.txt(telemetry section):- Link
opentelemetry::metricsandopentelemetry::otlp_http_metric_exportertargets
- Link
Key modified files:
conanfile.pyCMakeLists.txt(or the relevant telemetry cmake target)
Reference: 05-configuration-reference.md §5.3 — CMake integration
Task 7.2: Implement OTelCollector Class
Objective: Create the core OTelCollector implementation that maps beast::insight instruments to OTel Metrics SDK instruments.
What to do:
-
Create
include/xrpl/beast/insight/OTelCollector.h:- Public factory:
static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal) - Derives from
StatsDCollector(or directly fromCollector— TBD based on shared code)
- Public factory:
-
Create
src/libxrpl/beast/insight/OTelCollector.cpp(~400-500 lines):- OTelCounterImpl: Wraps
opentelemetry::metrics::Counter<int64_t>.increment(amount)callscounter->Add(amount). - OTelGaugeImpl: Uses
opentelemetry::metrics::ObservableGauge<uint64_t>with an async callback.set(value)stores value atomically; callback reads it during collection. - OTelMeterImpl: Wraps
opentelemetry::metrics::Counter<uint64_t>.increment(amount)callscounter->Add(amount). Semantically identical to Counter but unsigned. - OTelEventImpl: Wraps
opentelemetry::metrics::Histogram<double>.notify(duration)callshistogram->Record(duration.count()). Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms. - OTelHookImpl: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
- OTelCollectorImp: Main class.
- Creates
MeterProviderwithPeriodicMetricReader(1s export interval) - Creates
OtlpHttpMetricExporterpointing to[telemetry]endpoint - Sets resource attributes (service.name, service.instance.id) matching trace exporter
- Implements all
make_*()factory methods - Prefixes metric names with
[insight] prefix=value
- Creates
- OTelCounterImpl: Wraps
-
Guard all OTel SDK includes with
#ifdef XRPL_ENABLE_TELEMETRYto compile toNullCollectorequivalents when telemetry disabled.
Key new files:
include/xrpl/beast/insight/OTelCollector.hsrc/libxrpl/beast/insight/OTelCollector.cpp
Key patterns to follow:
- Match
StatsDCollector.cppstructure: private impl classes, intrusive list for metrics, strand-based thread safety - Match existing telemetry code style from
src/libxrpl/telemetry/Telemetry.cpp - Use RAII for MeterProvider lifecycle (shutdown on destructor)
Reference: 04-code-samples.md — code style and patterns
Task 7.3: Update CollectorManager
Objective: Add server=otel config option to route metric creation to the new OTel backend.
What to do:
-
Edit
src/xrpld/app/main/CollectorManager.cpp:- In the constructor, add a third branch after
server == "statsd":else if (server == "otel") { // Read endpoint from [telemetry] section auto const endpoint = get(telemetryParams, "endpoint", "http://localhost:4318/v1/metrics"); std::string const& prefix(get(params, "prefix")); m_collector = beast::insight::OTelCollector::New( endpoint, prefix, journal); } - This requires access to the
[telemetry]config section — may need to pass it as a parameter or read from Application config.
- In the constructor, add a third branch after
-
Edit
src/xrpld/app/main/CollectorManager.h:- Add
#include <xrpl/beast/insight/OTelCollector.h>
- Add
Key modified files:
src/xrpld/app/main/CollectorManager.cppsrc/xrpld/app/main/CollectorManager.h
Task 7.4: Update OTel Collector Configuration
Objective: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.
What to do:
-
Edit
docker/telemetry/otel-collector-config.yaml:- Remove
statsdreceiver (no longer needed whenserver=otel) - Add metrics pipeline under
service.pipelines:metrics: receivers: [otlp, spanmetrics] processors: [batch] exporters: [prometheus] - The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
- Keep
spanmetricsconnector in the metrics pipeline so span-derived RED metrics continue working.
- Remove
-
Edit
docker/telemetry/docker-compose.yml:- Remove UDP :8125 port mapping from otel-collector service
- Update rippled service config: change
[insight] server=statsdtoserver=otel
Key modified files:
docker/telemetry/otel-collector-config.yamldocker/telemetry/docker-compose.yml
Note: Keep a commented-out statsd receiver block for operators who need backward compatibility.
Task 7.5: Preserve Metric Names in Prometheus
Objective: Ensure existing Grafana dashboards continue working with identical metric names.
What to do:
-
In
OTelCollector.cpp, construct OTel instrument names to match existing Prometheus metric names:- beast::insight
make_gauge("LedgerMaster", "Validated_Ledger_Age")→ OTel instrument name:rippled_LedgerMaster_Validated_Ledger_Age - The prefix + group + name concatenation must produce the same string as
StatsDCollector's format - Use underscores as separators (matching StatsD convention)
- beast::insight
-
Verify in integration test that key Prometheus queries still return data:
rippled_LedgerMaster_Validated_Ledger_Agerippled_Peer_Finder_Active_Inbound_Peersrippled_rpc_requests
Key consideration: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds _total suffix to counters and converts dots to underscores — match existing conventions.
Task 7.6: Update Grafana Dashboards
Objective: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.
What to do:
- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
- If OTLP export produces different names (e.g.,
_totalsuffix on counters), update:docker/telemetry/grafana/dashboards/statsd-node-health.jsondocker/telemetry/grafana/dashboards/statsd-network-traffic.jsondocker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).
Key modified files:
docker/telemetry/grafana/dashboards/statsd-*.json(3 files, conditionally)
Task 7.7: Update Integration Tests
Objective: Verify the full OTLP metrics pipeline end-to-end.
What to do:
- Edit
docker/telemetry/integration-test.sh:- Update test config to use
[insight] server=otel - Verify metrics arrive in Prometheus via OTLP (not StatsD)
- Add check that StatsD receiver is no longer required
- Preserve all existing metric presence checks
- Update test config to use
Key modified files:
docker/telemetry/integration-test.sh
Task 7.8: Update Documentation
Objective: Update all plan docs, runbook, and reference docs to reflect the migration.
What to do:
-
Edit
docs/telemetry-runbook.md:- Update
[insight]config examples to showserver=otel - Update troubleshooting section (no more StatsD UDP debugging)
- Update
-
Edit
OpenTelemetryPlan/09-data-collection-reference.md:- Update Data Flow Overview diagram (remove StatsD receiver)
- Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
- Update config examples
-
Edit
OpenTelemetryPlan/05-configuration-reference.md:- Add
server=oteloption to[insight]section docs
- Add
-
Edit
docker/telemetry/TESTING.md:- Update setup instructions to use
server=otel
- Update setup instructions to use
Key modified files:
docs/telemetry-runbook.mdOpenTelemetryPlan/09-data-collection-reference.mdOpenTelemetryPlan/05-configuration-reference.mddocker/telemetry/TESTING.md
Task 7.9: ValidationTracker — Validation Agreement Computation
Source: External Dashboard Parity — the most valuable metric from the community xrpl-validator-dashboard.
Upstream: Phase 4 Task 4.8 (validation span attributes provide ledger hash context). Downstream: Phase 9 (Validator Health dashboard), Phase 10 (validation checks), Phase 11 (agreement alert rules).
Objective: Implement a stateful class that tracks whether our validator's validations agree with network consensus, maintaining rolling 1h and 24h windows with an 8-second grace period and 5-minute late repair window.
Architecture:
consensus.validation.send ────> ValidationTracker ────> MetricsRegistry
(records our validation (reconciles after (exports agreement
for ledger X) 8s grace period) gauges every 10s)
ledger.validate ──────────────> ValidationTracker
(records which ledger (marks ledger X as
network validated) agreed or missed)
What to do:
-
Create
src/xrpld/telemetry/ValidationTracker.h:recordOurValidation(ledgerHash, ledgerSeq)— called when we send a validationrecordNetworkValidation(ledgerHash, seq)— called when a ledger is fully validatedreconcile()— called periodically; reconciles pending ledger events after 8s grace period- Getters:
agreementPct1h(),agreementPct24h(),agreements1h(),missed1h(),agreements24h(),missed24h(),totalAgreements(),totalMissed(),totalValidationsSent(),totalValidationsChecked() - Thread-safety: atomics for counters, mutex for window deques
-
Create
src/xrpld/telemetry/detail/ValidationTracker.cpp:- Reconciliation logic: after 8s grace period, check if
weValidated && networkValidated && sameHash→ agreement; else missed - Late repair: if a late validation arrives within 5 minutes, correct a false-positive miss
- Sliding window:
std::deque<WindowEvent>evicts entries older than 1h/24h on each reconciliation pass - Ring buffer of 1000
LedgerEventstructs for pending reconciliation
- Reconciliation logic: after 8s grace period, check if
-
Add recording hooks (modifying Phase 4 code from Phase 7 branch):
RCLConsensus.cppvalidate(): calltracker.recordOurValidation()LedgerMaster.cppfully-validated path: calltracker.recordNetworkValidation()
Key data structures:
struct LedgerEvent {
uint256 ledgerHash;
LedgerIndex seq;
TimePoint closeTime;
bool weValidated = false;
bool networkValidated = false;
bool reconciled = false;
bool agreed = false;
};
struct WindowEvent {
TimePoint time;
bool agreed;
};
Key new files:
src/xrpld/telemetry/ValidationTracker.hsrc/xrpld/telemetry/detail/ValidationTracker.cpp
Key modified files:
src/xrpld/telemetry/MetricsRegistry.h(add ValidationTracker member)src/xrpld/telemetry/MetricsRegistry.cpp(add gauge callback reading from tracker)src/xrpld/app/consensus/RCLConsensus.cpp(add recording hooks)src/xrpld/app/ledger/detail/LedgerMaster.cpp(add recording hook)
Exit Criteria:
- ValidationTracker correctly tracks agreement with 8s grace period
- 5-minute late repair corrects false-positive misses
- Thread-safe (atomics + mutex for window deques)
- Rolling windows correctly evict stale entries
- Unit tests: normal agreement, missed validation, late repair, window eviction
Task 7.10: Validator Health Observable Gauges
Source: External Dashboard Parity
Objective: Export amendment blocked, UNL health, and quorum data as a native OTel observable gauge.
What to do:
- In
MetricsRegistry.cppregisterAsyncGauges(), add:
validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
"rippled_validator_health", "Validator health indicators");
Gauge label values:
Label metric= |
Type | Source |
|---|---|---|
amendment_blocked |
int64 | app_.getOPs().isAmendmentBlocked() → 0/1 |
unl_blocked |
int64 | app_.getOPs().isUNLBlocked() → 0/1 |
unl_expiry_days |
double | app_.validators().expires() → days until expiry |
validation_quorum |
int64 | app_.validators().quorum() |
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp
Exit Criteria:
- All 4 label values emitted every 10s
unl_expiry_daysis negative when expired, positive when active- Values visible in Prometheus
Task 7.11: Peer Quality Observable Gauges
Source: External Dashboard Parity
Objective: Export peer health aggregates (latency P90, insane peers, version awareness) as a native OTel observable gauge.
What to do:
- In
MetricsRegistry.cppregisterAsyncGauges(), add a callback that iteratesapp_.overlay().foreach(...)to:- Collect per-peer latency values, sort, compute P90
- Count peers with
tracking_ == diverged(insane) - Compare peer
getVersion()to own version for upgrade awareness
Gauge label values:
Label metric= |
Type | Source |
|---|---|---|
peer_latency_p90_ms |
double | P90 from sorted peer latencies |
peers_insane_count |
int64 | Peers with diverged tracking status |
peers_higher_version_pct |
double | % of peers on newer rippled version |
upgrade_recommended |
int64 | 1 if peers_higher_version_pct > 60% |
Implementation note: The callback runs every 10s on the metrics reader thread. Iterating ~50-200 peers is acceptable overhead.
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp
Exit Criteria:
- P90 latency computed correctly
- Insane count matches
peersRPC output - Version comparison handles format variations (e.g., "rippled-2.4.0-rc1")
Task 7.12: Ledger Economy Observable Gauges
Source: External Dashboard Parity
Objective: Export fee, reserve, ledger age, and transaction rate as a native OTel observable gauge.
Gauge label values:
Label metric= |
Type | Source |
|---|---|---|
base_fee_xrp |
double | Base fee from validated ledger fee settings (drops) |
reserve_base_xrp |
double | Account reserve from validated ledger (drops) |
reserve_inc_xrp |
double | Owner reserve increment (drops) |
ledger_age_seconds |
double | now - lastValidatedCloseTime |
transaction_rate |
double | Derived: tx count delta / time delta (smoothed) |
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp
Exit Criteria:
- Fee values match
server_infoRPC output ledger_age_secondsincreases monotonically between ledger closestransaction_rateis smoothed (rolling average)
Task 7.13: State Tracking Observable Gauges
Source: External Dashboard Parity
Objective: Export extended state value (0-6 encoding combining OperatingMode + ConsensusMode) and time-in-current-state.
Gauge label values:
Label metric= |
Type | Source |
|---|---|---|
state_value |
int64 | 0-6 encoding (see spec for mapping) |
time_in_current_state_seconds |
double | now - lastModeChangeTime from StateAccounting |
State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (full + validating), 6=proposing (full + proposing).
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp
Exit Criteria:
state_valuecorrectly combines OperatingMode and ConsensusModetime_in_current_state_secondsresets on mode change
Task 7.14: Storage Detail and Sync Info Gauges
Source: External Dashboard Parity
Objective: Export NuDB-specific storage size and initial sync duration.
Gauge label values:
| Gauge Name | Label metric= |
Type | Source |
|---|---|---|---|
rippled_storage_detail |
nudb_bytes |
int64 | NuDB backend file size |
rippled_sync_info |
initial_sync_duration_seconds |
double | Time from start to first FULL |
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp
Exit Criteria:
- NuDB file size reported in bytes (0 if NuDB not configured)
- Sync duration captured once and remains stable after reaching FULL
Task 7.15: New Synchronous Counters
Source: External Dashboard Parity
Objective: Add 7 new event counters incremented at their respective instrumentation sites.
| Counter Name | Increment Site | Source File |
|---|---|---|
rippled_ledgers_closed_total |
onAccept() in consensus |
RCLConsensus.cpp |
rippled_validations_sent_total |
validate() in consensus |
RCLConsensus.cpp |
rippled_validations_checked_total |
Network validation received | LedgerMaster.cpp |
rippled_validation_agreements_total |
ValidationTracker reconciliation | ValidationTracker.cpp |
rippled_validation_missed_total |
ValidationTracker reconciliation | ValidationTracker.cpp |
rippled_state_changes_total |
setMode() in NetworkOPs |
NetworkOPs.cpp |
rippled_jq_trans_overflow_total |
Job queue overflow path | JobQueue.cpp |
Key modified files: src/xrpld/telemetry/MetricsRegistry.h/.cpp (declarations), plus recording sites in RCLConsensus.cpp, LedgerMaster.cpp, NetworkOPs.cpp, JobQueue.cpp
Exit Criteria:
- All 7 counters monotonically increase during normal operation
- Counter values match expected rates (e.g., ledgers_closed ≈ 1 per 3-5s)
Task 7.16: Validation Agreement Observable Gauge
Source: External Dashboard Parity
Objective: Export rolling window agreement stats from ValidationTracker (Task 7.9).
Gauge label values:
| Gauge Name | Label metric= |
Type | Source |
|---|---|---|---|
rippled_validation_agreement |
agreement_pct_1h |
double | tracker.agreementPct1h() |
agreements_1h |
int64 | tracker.agreements1h() |
|
missed_1h |
int64 | tracker.missed1h() |
|
agreement_pct_24h |
double | tracker.agreementPct24h() |
|
agreements_24h |
int64 | tracker.agreements24h() |
|
missed_24h |
int64 | tracker.missed24h() |
Key modified files: src/xrpld/telemetry/MetricsRegistry.cpp
Exit Criteria:
- Agreement percentages in range [0.0, 100.0]
- Window stats stabilize after 1h/24h of operation
Summary Table
| Task | Description | New Files | Modified Files | Depends On |
|---|---|---|---|---|
| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | — |
| 7.2 | Implement OTelCollector class | 2 | 0 | 7.1 |
| 7.3 | Update CollectorManager config routing | 0 | 2 | 7.2 |
| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 7.3 |
| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 7.2 |
| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 7.5 |
| 7.7 | Update integration tests | 0 | 1 | 7.4 |
| 7.8 | Update documentation | 0 | 4 | 7.6 |
| 7.9 | ValidationTracker (agreement tracking) | 2 | 4 | 7.2, P4.8 |
| 7.10 | Validator health observable gauges | 0 | 2 | 7.2 |
| 7.11 | Peer quality observable gauges | 0 | 2 | 7.2 |
| 7.12 | Ledger economy observable gauges | 0 | 2 | 7.2 |
| 7.13 | State tracking observable gauges | 0 | 2 | 7.2 |
| 7.14 | Storage detail and sync info gauges | 0 | 2 | 7.2 |
| 7.15 | New synchronous counters | 0 | 6 | 7.2 |
| 7.16 | Validation agreement observable gauge | 0 | 1 | 7.9 |
Parallel work: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6. Tasks 7.10-7.14 can all run in parallel after 7.2. Task 7.15 depends on 7.2. Task 7.16 depends on 7.9. Task 7.9 depends on 7.2 and Phase 4 Task 4.8.
Exit Criteria (from 06-implementation-phases.md §6.8):
- All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
server=otelis the default in development docker-composeserver=statsdstill works as a fallback- Existing Grafana dashboards display data correctly
- Integration test passes with OTLP-only metrics pipeline
- No performance regression vs StatsD baseline (< 1% CPU overhead)
- Deferred Task 6.1 (
|mwire format) no longer relevant — Meter mapped to OTel Counter - ValidationTracker agreement % stabilizes after 1h under normal consensus
- All new gauges and counters visible in Prometheus with non-zero values