feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters

Wire the two previously-registered-but-never-incremented validation
counters to ValidationTracker's gross lifetime tallies, exported as
monotonic ObservableCounters. New gross atomics count each ledger once at
first classification and are never adjusted on late repair, keeping the
_total counters monotonic and additive (agreements_total + missed_total ==
ledgers reconciled); the repair-aware windowed view stays on the existing
xrpld_validation_agreement gauge. The validator-health dashboard panels
that already query these names now render data instead of "No data".

Also de-stale 09-data-collection-reference.md: §5b documented flat metric
names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits
labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the
stale flat-name tables with a pointer to the canonical labeled section,
reconcile the contradictory headline counts, and correct xrpld_job_count
to its real exported name xrpld_jobq_job_count.

Adds two GTests asserting gross tallies stay frozen on repair while net
totals move, plus the additive invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-06-05 18:29:29 +01:00
parent 0d1d1aa0e1
commit dc5bb4b35c
6 changed files with 288 additions and 119 deletions

View File

@@ -337,7 +337,7 @@ prefix=xrpld
| `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 1021 |
| `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
| `xrpld_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) |
| `xrpld_job_count` | JobQueue.cpp | Current job queue depth | 0100 (healthy) |
| `xrpld_jobq_job_count` | JobQueue.cpp | Current job queue depth (group `jobq`) | 0100 (healthy) |
**Grafana dashboard**: _Node Health (System Metrics)_ (`xrpld-system-node-health`)
@@ -592,90 +592,22 @@ count_over_time({job="xrpld"} |= "trace_id=" [5m])
---
## 5b. Future: Internal Metric Gap Fill (Phase 9)
## 5b. Internal Metric Gap Fill (Phase 9)
> **Status**: Planned, not yet implemented.
> **Status**: Implemented.
> **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context
> **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details
Phase 9 fills ~50+ metrics that exist inside xrpld but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories.
Phase 9 fills the metrics that exist inside xrpld but previously lacked time-series export. It
uses a hybrid approach: `beast::insight` extensions for NodeStore I/O plus OTel `ObservableGauge`
async callbacks for new categories.
### New Metric Categories
#### NodeStore I/O (via beast::insight)
| Prometheus Metric | Type | Description |
| ---------------------------------- | ----- | ----------------------------------- |
| `xrpld_nodestore_reads_total` | Gauge | Cumulative read operations |
| `xrpld_nodestore_reads_hit` | Gauge | Cache-served reads |
| `xrpld_nodestore_writes` | Gauge | Cumulative write operations |
| `xrpld_nodestore_written_bytes` | Gauge | Cumulative bytes written |
| `xrpld_nodestore_read_bytes` | Gauge | Cumulative bytes read |
| `xrpld_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) |
| `xrpld_nodestore_write_load` | Gauge | Current write load score |
| `xrpld_nodestore_read_queue` | Gauge | Items in read queue |
#### Cache Hit Rates (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ----------------------------- | ----- | ------------------------------------ |
| `xrpld_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) |
| `xrpld_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate |
| `xrpld_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate |
| `xrpld_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) |
| `xrpld_cache_fullbelow_size` | Gauge | FullBelow cache size |
#### Transaction Queue (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | -------------------------------- |
| `xrpld_txq_count` | Gauge | Current transactions in queue |
| `xrpld_txq_max_size` | Gauge | Maximum queue capacity |
| `xrpld_txq_in_ledger` | Gauge | Transactions in open ledger |
| `xrpld_txq_per_ledger` | Gauge | Expected transactions per ledger |
| `xrpld_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level |
| `xrpld_txq_med_fee_level` | Gauge | Median fee level in queue |
| `xrpld_txq_reference_fee_level` | Gauge | Reference fee level |
| `xrpld_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed |
#### PerfLog Per-RPC Method (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------- | --------- | ----------------- | --------------------------- |
| `xrpld_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `xrpld_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed |
| `xrpld_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls errored |
| `xrpld_rpc_method_duration_us_bucket` | Histogram | `method="<name>"` | Execution time distribution |
#### PerfLog Per-Job Type (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------- | --------- | ------------------- | --------------- |
| `xrpld_job_queued_total` | Counter | `job_type="<name>"` | Jobs queued |
| `xrpld_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `xrpld_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `xrpld_job_queued_duration_us_bucket` | Histogram | `job_type="<name>"` | Queue wait time |
| `xrpld_job_running_duration_us_bucket` | Histogram | `job_type="<name>"` | Execution time |
#### Counted Object Instances (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| -------------------- | ----- | --------------- | ------------------------------- |
| `xrpld_object_count` | Gauge | `type="<name>"` | Live instances of internal type |
Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry`
#### Fee Escalation & Load Factors (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ---------------------------------- | ----- | ------------------------------------ |
| `xrpld_load_factor` | Gauge | Combined transaction cost multiplier |
| `xrpld_load_factor_server` | Gauge | Server + cluster + network load |
| `xrpld_load_factor_local` | Gauge | Local server load only |
| `xrpld_load_factor_net` | Gauge | Network-wide load estimate |
| `xrpld_load_factor_cluster` | Gauge | Cluster peer load |
| `xrpld_load_factor_fee_escalation` | Gauge | Open ledger fee escalation |
| `xrpld_load_factor_fee_queue` | Gauge | Queue entry fee level |
> **Authoritative metric names live in [§ Phase 9: OTel SDK-Exported Metrics](#phase-9-otel-sdk-exported-metrics-metricsregistry) below.**
> Most internal metrics are emitted as **labeled** gauges — one instrument carrying many logical
> values via a `metric` label (e.g. `xrpld_cache_metrics{metric="SLE_hit_rate"}`,
> `xrpld_txq_metrics{metric="txq_count"}`, `xrpld_load_factor_metrics{metric="load_factor"}`,
> `xrpld_nodestore_state{metric="node_reads_total"}`) — not the flat per-name form. Query the
> labeled names; the flat names (`xrpld_cache_SLE_hit_rate`, `xrpld_txq_count`, …) are **not** emitted.
#### Server Info (via OTel MetricsRegistry)
@@ -746,15 +678,23 @@ Phase 10 builds a 5-node validator docker-compose harness with RPC load generato
### Validated Telemetry Inventory
| Category | Expected Count | Validation Method |
| ------------------ | -------------- | -------------------------------- |
| Trace spans | 16 | Jaeger/Tempo API query |
| Span attributes | 22 | Per-span attribute assertion |
| StatsD metrics | 255+ | Prometheus query |
| Phase 9 metrics | 68+ | Prometheus query |
| SpanMetrics RED | 4 per span | Prometheus query |
| Grafana dashboards | 10 | Dashboard API "no data" check |
| Log-trace links | Present | Loki query + Tempo reverse check |
> **Counting note — families vs series.** A _metric family_ is one distinct Prometheus `__name__`
> (histogram `_bucket`/`_count`/`_sum` collapsed to one). A _series_ is a family × its label
> combinations. The legacy overlay-traffic block is the bulk of the count: ~56 message categories ×
> 4 (`_Bytes_In/_Out`, `_Messages_In/_Out`) ≈ 224 families on its own. The labeled gauges
> (`xrpld_cache_metrics{metric}`, …) are few families but many series. Validate against the figures
> below as **families currently emitting** (idle nodes under-report — workload-gated metrics such as
> per-RPC/error counters appear only once exercised, which is Phase 10's purpose).
| Category | Expected Count | Validation Method |
| ------------------------- | ------------------- | -------------------------------- |
| Trace spans | 16 | Jaeger/Tempo API query |
| Span attributes | 22 | Per-span attribute assertion |
| Legacy `xrpld_*` families | ~270 (≈224 traffic) | Prometheus `__name__` query |
| Native MetricsRegistry | 35 instruments | Prometheus query |
| SpanMetrics RED | 4 per span | Prometheus query |
| Grafana dashboards | 10 | Dashboard API "no data" check |
| Log-trace links | Present | Loki query + Tempo reverse check |
---
@@ -998,15 +938,27 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full
#### Synchronous Counters (Phase 7+)
| Prometheus Metric | Type | Description | Increment Site |
| ----------------------------------- | ------- | -------------------------------- | --------------------- |
| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `xrpld_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp |
| `xrpld_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp |
| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
| Prometheus Metric | Type | Description | Increment Site |
| --------------------------------- | ------- | ------------------------------- | ---------------- |
| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
Lifetime validation agreement/miss tallies are exported as monotonic **ObservableCounters**
(not synchronous counters) observed from `ValidationTracker`'s gross lifetime totals:
| Prometheus Metric | Type | Description | Source |
| ----------------------------------- | ----------------- | ------------------------------------------ | --------------------- |
| `xrpld_validation_agreements_total` | ObservableCounter | Lifetime validations that initially agreed | ValidationTracker.cpp |
| `xrpld_validation_missed_total` | ObservableCounter | Lifetime validations that initially missed | ValidationTracker.cpp |
> **Counting semantics (initial-classification only):** each reconciled ledger increments exactly
> one of these two counters, at first classification. A later late-repair (miss → agreement) does
> **not** move either counter — keeping both strictly monotonic (a Prometheus `_total` must never
> decrease) and additive (`agreements_total + missed_total` = ledgers reconciled). The
> repair-aware, windowed view remains on `xrpld_validation_agreement{metric="…"}`.
#### Span Attribute Enrichments (Phases 2-4)
@@ -1071,7 +1023,7 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full
| Issue | Impact | Status |
| ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- |
| `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m``\|c` change in StatsDCollector.cpp |
| `xrpld_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `xrpld_jobq_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `xrpld_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg |
| Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet |

View File

@@ -132,6 +132,8 @@ TEST_F(ValidationTrackerTest, EmptyWindowReturnsZero)
EXPECT_EQ(tracker_.missed24h(), 0u);
EXPECT_EQ(tracker_.totalAgreements(), 0u);
EXPECT_EQ(tracker_.totalMissed(), 0u);
EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
EXPECT_EQ(tracker_.totalMissedEver(), 0u);
EXPECT_EQ(tracker_.totalValidationsSent(), 0u);
EXPECT_EQ(tracker_.totalValidationsChecked(), 0u);
}
@@ -282,3 +284,91 @@ TEST_F(ValidationTrackerTest, OnlyWeValidated)
EXPECT_EQ(tracker_.missed1h(), 1u);
EXPECT_DOUBLE_EQ(tracker_.agreementPct1h(), 0.0);
}
// ---------------------------------------------------------------
// 10. Gross miss tally is monotonic across a late repair
// The gross lifetime tallies (totalAgreementsEver/totalMissedEver)
// back the monotonic Prometheus _total counters. A late repair must
// move the NET totals (miss -> agreement) but must NOT move the gross
// tallies: a miss already counted stays counted, and the repair does
// not add a second (agreement) count for the same ledger.
// ---------------------------------------------------------------
TEST_F(ValidationTrackerTest, GrossMissedNeverDecrementsOnRepair)
{
auto const hash = makeHash(10);
LedgerIndex const seq = 1000;
// Network validates, we do not (yet).
tracker_.recordNetworkValidation(hash, seq);
// Grace period elapses -- reconciled as a miss.
std::this_thread::sleep_for(std::chrono::seconds(9));
tracker_.reconcile();
// Net and gross both show exactly one initial miss, zero agreements.
EXPECT_EQ(tracker_.totalMissed(), 1u);
EXPECT_EQ(tracker_.totalMissedEver(), 1u);
EXPECT_EQ(tracker_.totalAgreements(), 0u);
EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
// Late arrival of our validation repairs the miss to an agreement.
tracker_.recordOurValidation(hash, seq);
tracker_.reconcile();
// Net totals reflect the repair...
EXPECT_EQ(tracker_.totalMissed(), 0u);
EXPECT_EQ(tracker_.totalAgreements(), 1u);
// ...but the gross tallies are frozen at first classification: the miss
// stays counted and no agreement was added (repair path excluded).
EXPECT_EQ(tracker_.totalMissedEver(), 1u);
EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
}
// ---------------------------------------------------------------
// 11. Gross tallies count initial classification only (additive)
// With a mix of initial agreements and misses the gross tallies equal
// the net totals. A subsequent repair shifts the net totals but leaves
// the gross tallies unchanged, and the gross sum equals the number of
// reconciled ledgers (the additive invariant the _total counters rely on).
// ---------------------------------------------------------------
TEST_F(ValidationTrackerTest, GrossAgreementsCountInitialOnly)
{
// 3 initial agreements: both sides validate.
for (int i = 1; i <= 3; ++i)
{
auto const h = makeHash(static_cast<std::uint64_t>(i));
tracker_.recordOurValidation(h, static_cast<LedgerIndex>(i));
tracker_.recordNetworkValidation(h, static_cast<LedgerIndex>(i));
}
// 2 initial misses: only network validates.
for (int i = 4; i <= 5; ++i)
{
auto const h = makeHash(static_cast<std::uint64_t>(i));
tracker_.recordNetworkValidation(h, static_cast<LedgerIndex>(i));
}
// Grace period elapses -- all five reconciled at first classification.
std::this_thread::sleep_for(std::chrono::seconds(9));
tracker_.reconcile();
// Before any repair, gross equals net.
EXPECT_EQ(tracker_.totalAgreements(), 3u);
EXPECT_EQ(tracker_.totalAgreementsEver(), 3u);
EXPECT_EQ(tracker_.totalMissed(), 2u);
EXPECT_EQ(tracker_.totalMissedEver(), 2u);
// Repair one of the misses (hash 4) within the repair window.
tracker_.recordOurValidation(makeHash(4), 4);
tracker_.reconcile();
// Net totals shift by the repair...
EXPECT_EQ(tracker_.totalAgreements(), 4u);
EXPECT_EQ(tracker_.totalMissed(), 1u);
// ...gross tallies stay at the initial classification.
EXPECT_EQ(tracker_.totalAgreementsEver(), 3u);
EXPECT_EQ(tracker_.totalMissedEver(), 2u);
// Additive invariant: gross agree + gross miss == ledgers reconciled.
EXPECT_EQ(tracker_.totalAgreementsEver() + tracker_.totalMissedEver(), 5u);
}

View File

@@ -244,10 +244,9 @@ MetricsRegistry::start(std::string const& endpoint, std::string const& instanceI
"xrpld_txq_expired_total", "Total transactions expired out of the transaction queue");
txqDroppedCounter_ = meter_->CreateUInt64Counter(
"xrpld_txq_dropped_total", "Total transactions refused admission to the queue by reason");
validationAgreementsCounter_ = meter_->CreateUInt64Counter(
"xrpld_validation_agreements_total", "Total validation agreements");
validationMissedCounter_ =
meter_->CreateUInt64Counter("xrpld_validation_missed_total", "Total validation misses");
// Note: xrpld_validation_agreements_total / xrpld_validation_missed_total
// are monotonic ObservableCounters created in registerValidationTotalsCounters()
// (below), observed from ValidationTracker's gross lifetime tallies.
// Register all observable (async) gauges.
registerAsyncGauges();
@@ -441,6 +440,7 @@ MetricsRegistry::registerAsyncGauges()
registerStateTrackingGauge();
registerStorageDetailGauge();
registerValidationAgreementGauge();
registerValidationTotalsCounters();
}
void
@@ -1325,13 +1325,67 @@ MetricsRegistry::registerValidationAgreementGauge()
}
},
this);
}
// Note: validationAgreementsCounter_ and validationMissedCounter_ are
// created above but not currently incremented. The
// xrpld_validation_agreement gauge already provides agreement and miss
// counts from ValidationTracker's rolling windows and lifetime totals.
// These counters are reserved for future use if a push-style counter
// integration with ValidationTracker is desired.
void
MetricsRegistry::registerValidationTotalsCounters()
{
// Lifetime validation agreement/miss counters.
//
// These are monotonic ObservableCounters (not the sync Counters they used
// to be): a Prometheus _total must never decrease, but ValidationTracker's
// NET totals are non-monotonic (a late repair decrements the net miss
// count). We therefore observe the tracker's GROSS lifetime tallies, which
// count each ledger once at first classification and are never adjusted on
// repair (initial-classification semantics — see ValidationTracker). The
// repaired/agreement view remains available from xrpld_validation_agreement.
//
// reconcile() is called first so pending events are resolved before the
// tallies are read; the callback fires every ~10 s from the
// PeriodicExportingMetricReader thread.
validationAgreementsObservable_ = meter_->CreateInt64ObservableCounter(
"xrpld_validation_agreements_total",
"Lifetime validations that initially agreed with network consensus");
validationAgreementsObservable_->AddCallback(
[](opentelemetry::metrics::ObserverResult result, void* state) {
auto* self = static_cast<MetricsRegistry*>(state);
if (self->callbacksDetached_.load(std::memory_order_acquire))
return;
try
{
self->validationTracker_.reconcile();
opentelemetry::nostd::get<opentelemetry::nostd::shared_ptr<
opentelemetry::metrics::ObserverResultT<int64_t>>>(result)
->Observe(static_cast<int64_t>(self->validationTracker_.totalAgreementsEver()));
}
catch (...) // NOLINT(bugprone-empty-catch)
{
// Silently skip on error.
}
},
this);
validationMissedObservable_ = meter_->CreateInt64ObservableCounter(
"xrpld_validation_missed_total",
"Lifetime validations that initially missed network consensus");
validationMissedObservable_->AddCallback(
[](opentelemetry::metrics::ObserverResult result, void* state) {
auto* self = static_cast<MetricsRegistry*>(state);
if (self->callbacksDetached_.load(std::memory_order_acquire))
return;
try
{
self->validationTracker_.reconcile();
opentelemetry::nostd::get<opentelemetry::nostd::shared_ptr<
opentelemetry::metrics::ObserverResultT<int64_t>>>(result)
->Observe(static_cast<int64_t>(self->validationTracker_.totalMissedEver()));
}
catch (...) // NOLINT(bugprone-empty-catch)
{
// Silently skip on error.
}
},
this);
}
#endif // XRPL_ENABLE_TELEMETRY

View File

@@ -529,13 +529,16 @@ private:
/// Counter: xrpld_txq_dropped_total{reason} — incremented when a transaction is refused
/// admission to the queue.
opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>> txqDroppedCounter_;
/// Counter: xrpld_validation_agreements_total — incremented by ValidationTracker on
/// agreement.
opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>>
validationAgreementsCounter_;
/// Counter: xrpld_validation_missed_total — incremented by ValidationTracker on miss.
opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>>
validationMissedCounter_;
/// ObservableCounter: xrpld_validation_agreements_total — observed from
/// ValidationTracker::totalAgreementsEver() (monotonic gross lifetime
/// tally, initial-classification semantics).
opentelemetry::nostd::shared_ptr<opentelemetry::metrics::ObservableInstrument>
validationAgreementsObservable_;
/// ObservableCounter: xrpld_validation_missed_total — observed from
/// ValidationTracker::totalMissedEver() (monotonic gross lifetime tally,
/// initial-classification semantics).
opentelemetry::nostd::shared_ptr<opentelemetry::metrics::ObservableInstrument>
validationMissedObservable_;
/** Register all observable gauge callbacks with the OTel SDK.
Dispatches to one helper per metric domain so that each helper
@@ -580,6 +583,8 @@ private:
registerStorageDetailGauge(); // Task 7.13
void
registerValidationAgreementGauge(); // Task 7.15
void
registerValidationTotalsCounters(); // gap-fill: lifetime agree/miss _total
#endif // XRPL_ENABLE_TELEMETRY
};

View File

@@ -186,6 +186,26 @@ public:
uint64_t
totalMissed() const;
/** Lifetime agreements counted at first classification only.
* @note Unlike totalAgreements(), this is strictly monotonic: it is
* incremented only when a ledger is first reconciled as an agreement and
* is never adjusted by a late repair. It backs the monotonic Prometheus
* counter xrpld_validation_agreements_total. See the counting-semantics
* note in detail/ValidationTracker.cpp.
*/
uint64_t
totalAgreementsEver() const;
/** Lifetime misses counted at first classification only.
* @note Unlike totalMissed(), this is strictly monotonic: it is
* incremented only when a ledger is first reconciled as a miss and is
* never decremented by a late repair. It backs the monotonic Prometheus
* counter xrpld_validation_missed_total. See the counting-semantics note
* in detail/ValidationTracker.cpp.
*/
uint64_t
totalMissedEver() const;
/** Total validations this node sent. */
uint64_t
totalValidationsSent() const;
@@ -254,12 +274,33 @@ private:
/// Sliding window of reconciled events (last 7 days).
std::deque<WindowEvent> window7d_;
/// Lifetime count of agreements.
/// Lifetime count of agreements (net: incremented on agree, also on
/// repair). May be read via totalAgreements(); feeds the windowed gauge.
std::atomic<uint64_t> totalAgreements_{0};
/// Lifetime count of misses.
/// Lifetime count of misses (net: incremented on miss, decremented on
/// repair). NON-monotonic. May be read via totalMissed().
std::atomic<uint64_t> totalMissed_{0};
// Monotonic "gross" lifetime tallies for the Prometheus _total counters.
//
// Counting decision (initial-classification only): each reconciled ledger
// is counted exactly once, at its first classification, into exactly one
// of the two tallies below. A later late-repair (miss -> agreement) does
// NOT move either tally. This keeps both strictly monotonic (a Prometheus
// _total must never decrease) and additive:
// totalAgreementsGross_ + totalMissedGross_ == ledgers reconciled.
// The repaired/agreement view is still available from the windowed gauge
// (xrpld_validation_agreement) and the net totals above.
/// Monotonic lifetime initial agreements; backs
/// xrpld_validation_agreements_total. Never adjusted on repair.
std::atomic<uint64_t> totalAgreementsGross_{0};
/// Monotonic lifetime initial misses; backs xrpld_validation_missed_total.
/// Never decremented on repair.
std::atomic<uint64_t> totalMissedGross_{0};
/// Lifetime count of validations this node sent.
std::atomic<uint64_t> totalValidationsSent_{0};

View File

@@ -63,10 +63,16 @@ ValidationTracker::reconcile()
if (evt.agreed)
{
totalAgreements_.fetch_add(1, std::memory_order_relaxed);
// Gross tally: count the initial agreement once. See the
// counting-decision note below (repair branch).
totalAgreementsGross_.fetch_add(1, std::memory_order_relaxed);
}
else
{
totalMissed_.fetch_add(1, std::memory_order_relaxed);
// Gross tally: count the initial miss once. See the
// counting-decision note below (repair branch).
totalMissedGross_.fetch_add(1, std::memory_order_relaxed);
}
WindowEvent const we{.time = now, .ledgerHash = evt.ledgerHash, .agreed = evt.agreed};
@@ -78,11 +84,20 @@ ValidationTracker::reconcile()
evt.reconciled && !evt.agreed && evt.weValidated && evt.networkValidated &&
(now - evt.recordTime) <= kLateRepairWindow)
{
// Late repair: was a miss, now both flags set.
// Late repair: was a miss, now both flags set. Adjust the NET
// totals (used by the windowed agreement gauge) so the live view
// reflects the repair.
evt.agreed = true;
totalMissed_.fetch_sub(1, std::memory_order_relaxed);
totalAgreements_.fetch_add(1, std::memory_order_relaxed);
// Counting decision (initial-classification only): the gross
// tallies (totalAgreementsGross_ / totalMissedGross_) that back the
// monotonic Prometheus _total counters are deliberately NOT touched
// here. Each ledger is counted once, at first classification; a
// repair must not decrement missed (a _total may never decrease)
// nor add a second agreement (which would double-count the ledger).
// Flip the corresponding window entries from miss to agreement.
repairWindowEntry(window1h_, evt.ledgerHash);
repairWindowEntry(window24h_, evt.ledgerHash);
@@ -253,6 +268,18 @@ ValidationTracker::totalMissed() const
return totalMissed_.load(std::memory_order_relaxed);
}
uint64_t
ValidationTracker::totalAgreementsEver() const
{
return totalAgreementsGross_.load(std::memory_order_relaxed);
}
uint64_t
ValidationTracker::totalMissedEver() const
{
return totalMissedGross_.load(std::memory_order_relaxed);
}
uint64_t
ValidationTracker::totalValidationsSent() const
{