Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)

Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-10 16:20:54 +00:00
parent b73592f934
commit 9289cb671d
13 changed files with 2722 additions and 11 deletions

View File

@@ -231,7 +231,7 @@ When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yam
## Grafana Dashboards
Eight dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
Thirteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
### RPC Performance (`rippled-rpc-perf`)
@@ -403,8 +403,74 @@ count_over_time({job="rippled"} |= "trace_id=" [5m])
4. Open Grafana at http://localhost:3000 -> Explore -> Loki and search for `{job="rippled"} |= "trace_id="`.
5. Click the TraceID link to navigate to the corresponding trace in Tempo.
## Phase 9: OTel Metrics Alerting Rules
The following alerting rules are recommended for the Phase 9 OTel SDK metrics.
Add to your Prometheus alerting rules configuration.
### NodeStore
| Alert Name | Severity | Condition | For | Description |
| --------------------------- | -------- | ---------------------------------------------------- | --- | ------------------------------------------------------- |
| `NodeStoreHighWriteLoad` | Warning | `rippled_nodestore_state{metric="write_load"} > 100` | 5m | NodeStore backend is under sustained write pressure |
| `NodeStoreReadQueueBacklog` | Warning | `rippled_nodestore_state{metric="read_queue"} > 500` | 5m | Prefetch thread pool is saturated; reads are backing up |
### Cache
| Alert Name | Severity | Condition | For | Description |
| ----------------------- | -------- | ------------------------------------------------------- | --- | ------------------------------------------------------ |
| `SLECacheHitRateLow` | Warning | `rippled_cache_metrics{metric="SLE_hit_rate"} < 0.5` | 10m | SLE cache is thrashing; consider increasing cache size |
| `LedgerCacheHitRateLow` | Warning | `rippled_cache_metrics{metric="ledger_hit_rate"} < 0.5` | 10m | Ledger cache hit rate is degraded |
### Transaction Queue
| Alert Name | Severity | Condition | For | Description |
| ---------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------- | --- | -------------------------------------------------- |
| `TxQNearCapacity` | Warning | `rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"} > 0.8` | 5m | TxQ is >80% full; transactions may be rejected |
| `TxQHighFeeEscalation` | Warning | `rippled_txq_metrics{metric="txq_open_ledger_fee_level"} / rippled_txq_metrics{metric="txq_reference_fee_level"} > 10` | 5m | Fee escalation is 10x above reference; high demand |
### Load Factor
| Alert Name | Severity | Condition | For | Description |
| --------------------- | -------- | -------------------------------------------------------------- | --- | -------------------------------------------------------------- |
| `HighLoadFactor` | Warning | `rippled_load_factor_metrics{metric="load_factor"} > 5` | 10m | Combined load factor is elevated; transactions cost 5x+ normal |
| `HighLocalLoadFactor` | Critical | `rippled_load_factor_metrics{metric="load_factor_local"} > 10` | 5m | Local server load is critically elevated |
### RPC Performance
| Alert Name | Severity | Condition | For | Description |
| ------------------ | -------- | ---------------------------------------------------------------------------------------------------------- | --- | --------------------------------- |
| `HighRPCErrorRate` | Warning | `sum(rate(rippled_rpc_method_errored_total[5m])) / sum(rate(rippled_rpc_method_started_total[5m])) > 0.05` | 5m | >5% of RPC calls are erroring |
| `SlowRPCLatency` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_rpc_method_duration_us_bucket[5m]))) > 5000000` | 5m | RPC p95 latency exceeds 5 seconds |
### Job Queue
| Alert Name | Severity | Condition | For | Description |
| ------------------ | -------- | ----------------------------------------------------------------------------------------------------- | --- | ---------------------------------------------------- |
| `JobQueueBacklog` | Warning | `sum(rate(rippled_job_queued_total[5m])) - sum(rate(rippled_job_finished_total[5m])) > 100` | 5m | Jobs are being queued faster than they're completing |
| `SlowJobExecution` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_job_running_duration_us_bucket[5m]))) > 10000000` | 5m | Job execution p95 exceeds 10 seconds |
## Troubleshooting
### No OTel SDK metrics in Prometheus
1. Verify `enabled=1` in the `[telemetry]` config section
2. Check that `metrics_endpoint` points to the OTel Collector's HTTP receiver
(default: `http://localhost:4318/v1/metrics`)
3. Check rippled logs for `MetricsRegistry: started successfully` message
4. Verify the OTel Collector is configured with an OTLP receiver and Prometheus exporter
5. Check Prometheus targets page for the collector scrape target
### Cache hit rates are zero
Cache hit rates may be zero during startup before caches are warmed. Wait for the
node to reach `Full` operating mode and process several ledgers before investigating.
### NodeStore I/O counters not incrementing
NodeStore counters are cumulative and may appear flat if the node is idle. Submit
some transactions or RPC requests to generate I/O activity.
### No traces appearing in Jaeger
1. Check rippled logs for `Telemetry starting` message