diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index fc59db1024..b4221857ab 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -337,7 +337,7 @@ prefix=xrpld | `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 10–21 | | `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth | | `xrpld_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) | -| `xrpld_job_count` | JobQueue.cpp | Current job queue depth | 0–100 (healthy) | +| `xrpld_jobq_job_count` | JobQueue.cpp | Current job queue depth (group `jobq`) | 0–100 (healthy) | **Grafana dashboard**: _Node Health (System Metrics)_ (`xrpld-system-node-health`) @@ -592,90 +592,22 @@ count_over_time({job="xrpld"} |= "trace_id=" [5m]) --- -## 5b. Future: Internal Metric Gap Fill (Phase 9) +## 5b. Internal Metric Gap Fill (Phase 9) -> **Status**: Planned, not yet implemented. +> **Status**: Implemented. > **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context > **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details -Phase 9 fills ~50+ metrics that exist inside xrpld but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories. +Phase 9 fills the metrics that exist inside xrpld but previously lacked time-series export. It +uses a hybrid approach: `beast::insight` extensions for NodeStore I/O plus OTel `ObservableGauge` +async callbacks for new categories. -### New Metric Categories - -#### NodeStore I/O (via beast::insight) - -| Prometheus Metric | Type | Description | -| ---------------------------------- | ----- | ----------------------------------- | -| `xrpld_nodestore_reads_total` | Gauge | Cumulative read operations | -| `xrpld_nodestore_reads_hit` | Gauge | Cache-served reads | -| `xrpld_nodestore_writes` | Gauge | Cumulative write operations | -| `xrpld_nodestore_written_bytes` | Gauge | Cumulative bytes written | -| `xrpld_nodestore_read_bytes` | Gauge | Cumulative bytes read | -| `xrpld_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) | -| `xrpld_nodestore_write_load` | Gauge | Current write load score | -| `xrpld_nodestore_read_queue` | Gauge | Items in read queue | - -#### Cache Hit Rates (via OTel MetricsRegistry) - -| Prometheus Metric | Type | Description | -| ----------------------------- | ----- | ------------------------------------ | -| `xrpld_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) | -| `xrpld_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate | -| `xrpld_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate | -| `xrpld_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) | -| `xrpld_cache_fullbelow_size` | Gauge | FullBelow cache size | - -#### Transaction Queue (via OTel MetricsRegistry) - -| Prometheus Metric | Type | Description | -| ------------------------------------ | ----- | -------------------------------- | -| `xrpld_txq_count` | Gauge | Current transactions in queue | -| `xrpld_txq_max_size` | Gauge | Maximum queue capacity | -| `xrpld_txq_in_ledger` | Gauge | Transactions in open ledger | -| `xrpld_txq_per_ledger` | Gauge | Expected transactions per ledger | -| `xrpld_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level | -| `xrpld_txq_med_fee_level` | Gauge | Median fee level in queue | -| `xrpld_txq_reference_fee_level` | Gauge | Reference fee level | -| `xrpld_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed | - -#### PerfLog Per-RPC Method (via OTel Metrics SDK) - -| Prometheus Metric | Type | Labels | Description | -| ------------------------------------- | --------- | ----------------- | --------------------------- | -| `xrpld_rpc_method_started_total` | Counter | `method=""` | RPC calls started | -| `xrpld_rpc_method_finished_total` | Counter | `method=""` | RPC calls completed | -| `xrpld_rpc_method_errored_total` | Counter | `method=""` | RPC calls errored | -| `xrpld_rpc_method_duration_us_bucket` | Histogram | `method=""` | Execution time distribution | - -#### PerfLog Per-Job Type (via OTel Metrics SDK) - -| Prometheus Metric | Type | Labels | Description | -| -------------------------------------- | --------- | ------------------- | --------------- | -| `xrpld_job_queued_total` | Counter | `job_type=""` | Jobs queued | -| `xrpld_job_started_total` | Counter | `job_type=""` | Jobs started | -| `xrpld_job_finished_total` | Counter | `job_type=""` | Jobs completed | -| `xrpld_job_queued_duration_us_bucket` | Histogram | `job_type=""` | Queue wait time | -| `xrpld_job_running_duration_us_bucket` | Histogram | `job_type=""` | Execution time | - -#### Counted Object Instances (via OTel MetricsRegistry) - -| Prometheus Metric | Type | Labels | Description | -| -------------------- | ----- | --------------- | ------------------------------- | -| `xrpld_object_count` | Gauge | `type=""` | Live instances of internal type | - -Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry` - -#### Fee Escalation & Load Factors (via OTel MetricsRegistry) - -| Prometheus Metric | Type | Description | -| ---------------------------------- | ----- | ------------------------------------ | -| `xrpld_load_factor` | Gauge | Combined transaction cost multiplier | -| `xrpld_load_factor_server` | Gauge | Server + cluster + network load | -| `xrpld_load_factor_local` | Gauge | Local server load only | -| `xrpld_load_factor_net` | Gauge | Network-wide load estimate | -| `xrpld_load_factor_cluster` | Gauge | Cluster peer load | -| `xrpld_load_factor_fee_escalation` | Gauge | Open ledger fee escalation | -| `xrpld_load_factor_fee_queue` | Gauge | Queue entry fee level | +> **Authoritative metric names live in [§ Phase 9: OTel SDK-Exported Metrics](#phase-9-otel-sdk-exported-metrics-metricsregistry) below.** +> Most internal metrics are emitted as **labeled** gauges — one instrument carrying many logical +> values via a `metric` label (e.g. `xrpld_cache_metrics{metric="SLE_hit_rate"}`, +> `xrpld_txq_metrics{metric="txq_count"}`, `xrpld_load_factor_metrics{metric="load_factor"}`, +> `xrpld_nodestore_state{metric="node_reads_total"}`) — not the flat per-name form. Query the +> labeled names; the flat names (`xrpld_cache_SLE_hit_rate`, `xrpld_txq_count`, …) are **not** emitted. #### Server Info (via OTel MetricsRegistry) @@ -746,15 +678,23 @@ Phase 10 builds a 5-node validator docker-compose harness with RPC load generato ### Validated Telemetry Inventory -| Category | Expected Count | Validation Method | -| ------------------ | -------------- | -------------------------------- | -| Trace spans | 16 | Jaeger/Tempo API query | -| Span attributes | 22 | Per-span attribute assertion | -| StatsD metrics | 255+ | Prometheus query | -| Phase 9 metrics | 68+ | Prometheus query | -| SpanMetrics RED | 4 per span | Prometheus query | -| Grafana dashboards | 10 | Dashboard API "no data" check | -| Log-trace links | Present | Loki query + Tempo reverse check | +> **Counting note — families vs series.** A _metric family_ is one distinct Prometheus `__name__` +> (histogram `_bucket`/`_count`/`_sum` collapsed to one). A _series_ is a family × its label +> combinations. The legacy overlay-traffic block is the bulk of the count: ~56 message categories × +> 4 (`_Bytes_In/_Out`, `_Messages_In/_Out`) ≈ 224 families on its own. The labeled gauges +> (`xrpld_cache_metrics{metric}`, …) are few families but many series. Validate against the figures +> below as **families currently emitting** (idle nodes under-report — workload-gated metrics such as +> per-RPC/error counters appear only once exercised, which is Phase 10's purpose). + +| Category | Expected Count | Validation Method | +| ------------------------- | ------------------- | -------------------------------- | +| Trace spans | 16 | Jaeger/Tempo API query | +| Span attributes | 22 | Per-span attribute assertion | +| Legacy `xrpld_*` families | ~270 (≈224 traffic) | Prometheus `__name__` query | +| Native MetricsRegistry | 35 instruments | Prometheus query | +| SpanMetrics RED | 4 per span | Prometheus query | +| Grafana dashboards | 10 | Dashboard API "no data" check | +| Log-trace links | Present | Loki query + Tempo reverse check | --- @@ -998,15 +938,27 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full #### Synchronous Counters (Phase 7+) -| Prometheus Metric | Type | Description | Increment Site | -| ----------------------------------- | ------- | -------------------------------- | --------------------- | -| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp | -| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp | -| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp | -| `xrpld_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp | -| `xrpld_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp | -| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp | -| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp | +| Prometheus Metric | Type | Description | Increment Site | +| --------------------------------- | ------- | ------------------------------- | ---------------- | +| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp | +| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp | +| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp | +| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp | +| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp | + +Lifetime validation agreement/miss tallies are exported as monotonic **ObservableCounters** +(not synchronous counters) observed from `ValidationTracker`'s gross lifetime totals: + +| Prometheus Metric | Type | Description | Source | +| ----------------------------------- | ----------------- | ------------------------------------------ | --------------------- | +| `xrpld_validation_agreements_total` | ObservableCounter | Lifetime validations that initially agreed | ValidationTracker.cpp | +| `xrpld_validation_missed_total` | ObservableCounter | Lifetime validations that initially missed | ValidationTracker.cpp | + +> **Counting semantics (initial-classification only):** each reconciled ledger increments exactly +> one of these two counters, at first classification. A later late-repair (miss → agreement) does +> **not** move either counter — keeping both strictly monotonic (a Prometheus `_total` must never +> decrease) and additive (`agreements_total + missed_total` = ledgers reconciled). The +> repair-aware, windowed view remains on `xrpld_validation_agreement{metric="…"}`. #### Span Attribute Enrichments (Phases 2-4) @@ -1071,7 +1023,7 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full | Issue | Impact | Status | | ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- | | `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m` → `\|c` change in StatsDCollector.cpp | -| `xrpld_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity | +| `xrpld_jobq_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity | | `xrpld_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg | | Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet | diff --git a/src/tests/libxrpl/telemetry/ValidationTracker.cpp b/src/tests/libxrpl/telemetry/ValidationTracker.cpp index 7a5179c871..d2b96aa616 100644 --- a/src/tests/libxrpl/telemetry/ValidationTracker.cpp +++ b/src/tests/libxrpl/telemetry/ValidationTracker.cpp @@ -132,6 +132,8 @@ TEST_F(ValidationTrackerTest, EmptyWindowReturnsZero) EXPECT_EQ(tracker_.missed24h(), 0u); EXPECT_EQ(tracker_.totalAgreements(), 0u); EXPECT_EQ(tracker_.totalMissed(), 0u); + EXPECT_EQ(tracker_.totalAgreementsEver(), 0u); + EXPECT_EQ(tracker_.totalMissedEver(), 0u); EXPECT_EQ(tracker_.totalValidationsSent(), 0u); EXPECT_EQ(tracker_.totalValidationsChecked(), 0u); } @@ -282,3 +284,91 @@ TEST_F(ValidationTrackerTest, OnlyWeValidated) EXPECT_EQ(tracker_.missed1h(), 1u); EXPECT_DOUBLE_EQ(tracker_.agreementPct1h(), 0.0); } + +// --------------------------------------------------------------- +// 10. Gross miss tally is monotonic across a late repair +// The gross lifetime tallies (totalAgreementsEver/totalMissedEver) +// back the monotonic Prometheus _total counters. A late repair must +// move the NET totals (miss -> agreement) but must NOT move the gross +// tallies: a miss already counted stays counted, and the repair does +// not add a second (agreement) count for the same ledger. +// --------------------------------------------------------------- +TEST_F(ValidationTrackerTest, GrossMissedNeverDecrementsOnRepair) +{ + auto const hash = makeHash(10); + LedgerIndex const seq = 1000; + + // Network validates, we do not (yet). + tracker_.recordNetworkValidation(hash, seq); + + // Grace period elapses -- reconciled as a miss. + std::this_thread::sleep_for(std::chrono::seconds(9)); + tracker_.reconcile(); + + // Net and gross both show exactly one initial miss, zero agreements. + EXPECT_EQ(tracker_.totalMissed(), 1u); + EXPECT_EQ(tracker_.totalMissedEver(), 1u); + EXPECT_EQ(tracker_.totalAgreements(), 0u); + EXPECT_EQ(tracker_.totalAgreementsEver(), 0u); + + // Late arrival of our validation repairs the miss to an agreement. + tracker_.recordOurValidation(hash, seq); + tracker_.reconcile(); + + // Net totals reflect the repair... + EXPECT_EQ(tracker_.totalMissed(), 0u); + EXPECT_EQ(tracker_.totalAgreements(), 1u); + // ...but the gross tallies are frozen at first classification: the miss + // stays counted and no agreement was added (repair path excluded). + EXPECT_EQ(tracker_.totalMissedEver(), 1u); + EXPECT_EQ(tracker_.totalAgreementsEver(), 0u); +} + +// --------------------------------------------------------------- +// 11. Gross tallies count initial classification only (additive) +// With a mix of initial agreements and misses the gross tallies equal +// the net totals. A subsequent repair shifts the net totals but leaves +// the gross tallies unchanged, and the gross sum equals the number of +// reconciled ledgers (the additive invariant the _total counters rely on). +// --------------------------------------------------------------- +TEST_F(ValidationTrackerTest, GrossAgreementsCountInitialOnly) +{ + // 3 initial agreements: both sides validate. + for (int i = 1; i <= 3; ++i) + { + auto const h = makeHash(static_cast(i)); + tracker_.recordOurValidation(h, static_cast(i)); + tracker_.recordNetworkValidation(h, static_cast(i)); + } + + // 2 initial misses: only network validates. + for (int i = 4; i <= 5; ++i) + { + auto const h = makeHash(static_cast(i)); + tracker_.recordNetworkValidation(h, static_cast(i)); + } + + // Grace period elapses -- all five reconciled at first classification. + std::this_thread::sleep_for(std::chrono::seconds(9)); + tracker_.reconcile(); + + // Before any repair, gross equals net. + EXPECT_EQ(tracker_.totalAgreements(), 3u); + EXPECT_EQ(tracker_.totalAgreementsEver(), 3u); + EXPECT_EQ(tracker_.totalMissed(), 2u); + EXPECT_EQ(tracker_.totalMissedEver(), 2u); + + // Repair one of the misses (hash 4) within the repair window. + tracker_.recordOurValidation(makeHash(4), 4); + tracker_.reconcile(); + + // Net totals shift by the repair... + EXPECT_EQ(tracker_.totalAgreements(), 4u); + EXPECT_EQ(tracker_.totalMissed(), 1u); + // ...gross tallies stay at the initial classification. + EXPECT_EQ(tracker_.totalAgreementsEver(), 3u); + EXPECT_EQ(tracker_.totalMissedEver(), 2u); + + // Additive invariant: gross agree + gross miss == ledgers reconciled. + EXPECT_EQ(tracker_.totalAgreementsEver() + tracker_.totalMissedEver(), 5u); +} diff --git a/src/xrpld/telemetry/MetricsRegistry.cpp b/src/xrpld/telemetry/MetricsRegistry.cpp index 8ca0c15889..bd51db3b51 100644 --- a/src/xrpld/telemetry/MetricsRegistry.cpp +++ b/src/xrpld/telemetry/MetricsRegistry.cpp @@ -244,10 +244,9 @@ MetricsRegistry::start(std::string const& endpoint, std::string const& instanceI "xrpld_txq_expired_total", "Total transactions expired out of the transaction queue"); txqDroppedCounter_ = meter_->CreateUInt64Counter( "xrpld_txq_dropped_total", "Total transactions refused admission to the queue by reason"); - validationAgreementsCounter_ = meter_->CreateUInt64Counter( - "xrpld_validation_agreements_total", "Total validation agreements"); - validationMissedCounter_ = - meter_->CreateUInt64Counter("xrpld_validation_missed_total", "Total validation misses"); + // Note: xrpld_validation_agreements_total / xrpld_validation_missed_total + // are monotonic ObservableCounters created in registerValidationTotalsCounters() + // (below), observed from ValidationTracker's gross lifetime tallies. // Register all observable (async) gauges. registerAsyncGauges(); @@ -441,6 +440,7 @@ MetricsRegistry::registerAsyncGauges() registerStateTrackingGauge(); registerStorageDetailGauge(); registerValidationAgreementGauge(); + registerValidationTotalsCounters(); } void @@ -1325,13 +1325,67 @@ MetricsRegistry::registerValidationAgreementGauge() } }, this); +} - // Note: validationAgreementsCounter_ and validationMissedCounter_ are - // created above but not currently incremented. The - // xrpld_validation_agreement gauge already provides agreement and miss - // counts from ValidationTracker's rolling windows and lifetime totals. - // These counters are reserved for future use if a push-style counter - // integration with ValidationTracker is desired. +void +MetricsRegistry::registerValidationTotalsCounters() +{ + // Lifetime validation agreement/miss counters. + // + // These are monotonic ObservableCounters (not the sync Counters they used + // to be): a Prometheus _total must never decrease, but ValidationTracker's + // NET totals are non-monotonic (a late repair decrements the net miss + // count). We therefore observe the tracker's GROSS lifetime tallies, which + // count each ledger once at first classification and are never adjusted on + // repair (initial-classification semantics — see ValidationTracker). The + // repaired/agreement view remains available from xrpld_validation_agreement. + // + // reconcile() is called first so pending events are resolved before the + // tallies are read; the callback fires every ~10 s from the + // PeriodicExportingMetricReader thread. + validationAgreementsObservable_ = meter_->CreateInt64ObservableCounter( + "xrpld_validation_agreements_total", + "Lifetime validations that initially agreed with network consensus"); + validationAgreementsObservable_->AddCallback( + [](opentelemetry::metrics::ObserverResult result, void* state) { + auto* self = static_cast(state); + if (self->callbacksDetached_.load(std::memory_order_acquire)) + return; + try + { + self->validationTracker_.reconcile(); + opentelemetry::nostd::get>>(result) + ->Observe(static_cast(self->validationTracker_.totalAgreementsEver())); + } + catch (...) // NOLINT(bugprone-empty-catch) + { + // Silently skip on error. + } + }, + this); + + validationMissedObservable_ = meter_->CreateInt64ObservableCounter( + "xrpld_validation_missed_total", + "Lifetime validations that initially missed network consensus"); + validationMissedObservable_->AddCallback( + [](opentelemetry::metrics::ObserverResult result, void* state) { + auto* self = static_cast(state); + if (self->callbacksDetached_.load(std::memory_order_acquire)) + return; + try + { + self->validationTracker_.reconcile(); + opentelemetry::nostd::get>>(result) + ->Observe(static_cast(self->validationTracker_.totalMissedEver())); + } + catch (...) // NOLINT(bugprone-empty-catch) + { + // Silently skip on error. + } + }, + this); } #endif // XRPL_ENABLE_TELEMETRY diff --git a/src/xrpld/telemetry/MetricsRegistry.h b/src/xrpld/telemetry/MetricsRegistry.h index 63a240ef75..f0986b0b33 100644 --- a/src/xrpld/telemetry/MetricsRegistry.h +++ b/src/xrpld/telemetry/MetricsRegistry.h @@ -529,13 +529,16 @@ private: /// Counter: xrpld_txq_dropped_total{reason} — incremented when a transaction is refused /// admission to the queue. opentelemetry::nostd::unique_ptr> txqDroppedCounter_; - /// Counter: xrpld_validation_agreements_total — incremented by ValidationTracker on - /// agreement. - opentelemetry::nostd::unique_ptr> - validationAgreementsCounter_; - /// Counter: xrpld_validation_missed_total — incremented by ValidationTracker on miss. - opentelemetry::nostd::unique_ptr> - validationMissedCounter_; + /// ObservableCounter: xrpld_validation_agreements_total — observed from + /// ValidationTracker::totalAgreementsEver() (monotonic gross lifetime + /// tally, initial-classification semantics). + opentelemetry::nostd::shared_ptr + validationAgreementsObservable_; + /// ObservableCounter: xrpld_validation_missed_total — observed from + /// ValidationTracker::totalMissedEver() (monotonic gross lifetime tally, + /// initial-classification semantics). + opentelemetry::nostd::shared_ptr + validationMissedObservable_; /** Register all observable gauge callbacks with the OTel SDK. Dispatches to one helper per metric domain so that each helper @@ -580,6 +583,8 @@ private: registerStorageDetailGauge(); // Task 7.13 void registerValidationAgreementGauge(); // Task 7.15 + void + registerValidationTotalsCounters(); // gap-fill: lifetime agree/miss _total #endif // XRPL_ENABLE_TELEMETRY }; diff --git a/src/xrpld/telemetry/ValidationTracker.h b/src/xrpld/telemetry/ValidationTracker.h index dac2f9c706..301ad31fe0 100644 --- a/src/xrpld/telemetry/ValidationTracker.h +++ b/src/xrpld/telemetry/ValidationTracker.h @@ -186,6 +186,26 @@ public: uint64_t totalMissed() const; + /** Lifetime agreements counted at first classification only. + * @note Unlike totalAgreements(), this is strictly monotonic: it is + * incremented only when a ledger is first reconciled as an agreement and + * is never adjusted by a late repair. It backs the monotonic Prometheus + * counter xrpld_validation_agreements_total. See the counting-semantics + * note in detail/ValidationTracker.cpp. + */ + uint64_t + totalAgreementsEver() const; + + /** Lifetime misses counted at first classification only. + * @note Unlike totalMissed(), this is strictly monotonic: it is + * incremented only when a ledger is first reconciled as a miss and is + * never decremented by a late repair. It backs the monotonic Prometheus + * counter xrpld_validation_missed_total. See the counting-semantics note + * in detail/ValidationTracker.cpp. + */ + uint64_t + totalMissedEver() const; + /** Total validations this node sent. */ uint64_t totalValidationsSent() const; @@ -254,12 +274,33 @@ private: /// Sliding window of reconciled events (last 7 days). std::deque window7d_; - /// Lifetime count of agreements. + /// Lifetime count of agreements (net: incremented on agree, also on + /// repair). May be read via totalAgreements(); feeds the windowed gauge. std::atomic totalAgreements_{0}; - /// Lifetime count of misses. + /// Lifetime count of misses (net: incremented on miss, decremented on + /// repair). NON-monotonic. May be read via totalMissed(). std::atomic totalMissed_{0}; + // Monotonic "gross" lifetime tallies for the Prometheus _total counters. + // + // Counting decision (initial-classification only): each reconciled ledger + // is counted exactly once, at its first classification, into exactly one + // of the two tallies below. A later late-repair (miss -> agreement) does + // NOT move either tally. This keeps both strictly monotonic (a Prometheus + // _total must never decrease) and additive: + // totalAgreementsGross_ + totalMissedGross_ == ledgers reconciled. + // The repaired/agreement view is still available from the windowed gauge + // (xrpld_validation_agreement) and the net totals above. + + /// Monotonic lifetime initial agreements; backs + /// xrpld_validation_agreements_total. Never adjusted on repair. + std::atomic totalAgreementsGross_{0}; + + /// Monotonic lifetime initial misses; backs xrpld_validation_missed_total. + /// Never decremented on repair. + std::atomic totalMissedGross_{0}; + /// Lifetime count of validations this node sent. std::atomic totalValidationsSent_{0}; diff --git a/src/xrpld/telemetry/detail/ValidationTracker.cpp b/src/xrpld/telemetry/detail/ValidationTracker.cpp index 38e065d8b5..a3124100d0 100644 --- a/src/xrpld/telemetry/detail/ValidationTracker.cpp +++ b/src/xrpld/telemetry/detail/ValidationTracker.cpp @@ -63,10 +63,16 @@ ValidationTracker::reconcile() if (evt.agreed) { totalAgreements_.fetch_add(1, std::memory_order_relaxed); + // Gross tally: count the initial agreement once. See the + // counting-decision note below (repair branch). + totalAgreementsGross_.fetch_add(1, std::memory_order_relaxed); } else { totalMissed_.fetch_add(1, std::memory_order_relaxed); + // Gross tally: count the initial miss once. See the + // counting-decision note below (repair branch). + totalMissedGross_.fetch_add(1, std::memory_order_relaxed); } WindowEvent const we{.time = now, .ledgerHash = evt.ledgerHash, .agreed = evt.agreed}; @@ -78,11 +84,20 @@ ValidationTracker::reconcile() evt.reconciled && !evt.agreed && evt.weValidated && evt.networkValidated && (now - evt.recordTime) <= kLateRepairWindow) { - // Late repair: was a miss, now both flags set. + // Late repair: was a miss, now both flags set. Adjust the NET + // totals (used by the windowed agreement gauge) so the live view + // reflects the repair. evt.agreed = true; totalMissed_.fetch_sub(1, std::memory_order_relaxed); totalAgreements_.fetch_add(1, std::memory_order_relaxed); + // Counting decision (initial-classification only): the gross + // tallies (totalAgreementsGross_ / totalMissedGross_) that back the + // monotonic Prometheus _total counters are deliberately NOT touched + // here. Each ledger is counted once, at first classification; a + // repair must not decrement missed (a _total may never decrease) + // nor add a second agreement (which would double-count the ledger). + // Flip the corresponding window entries from miss to agreement. repairWindowEntry(window1h_, evt.ledgerHash); repairWindowEntry(window24h_, evt.ledgerHash); @@ -253,6 +268,18 @@ ValidationTracker::totalMissed() const return totalMissed_.load(std::memory_order_relaxed); } +uint64_t +ValidationTracker::totalAgreementsEver() const +{ + return totalAgreementsGross_.load(std::memory_order_relaxed); +} + +uint64_t +ValidationTracker::totalMissedEver() const +{ + return totalMissedGross_.load(std::memory_order_relaxed); +} + uint64_t ValidationTracker::totalValidationsSent() const {