Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)

Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-10 16:20:54 +00:00
parent b73592f934
commit 9289cb671d
13 changed files with 2722 additions and 11 deletions

View File

@@ -21,7 +21,8 @@ class PerfLog;
}
namespace telemetry {
class Telemetry;
}
class MetricsRegistry;
} // namespace telemetry
// This is temporary until we migrate all code to use ServiceRegistry.
class Application;
@@ -211,6 +212,12 @@ public:
virtual telemetry::Telemetry&
getTelemetry() = 0;
/** Return the MetricsRegistry, or nullptr if telemetry is disabled.
Used by PerfLog and other hot paths to record OTel metrics.
*/
virtual telemetry::MetricsRegistry*
getMetricsRegistry() = 0;
// Configuration and state
virtual bool
isStopping() const = 0;