mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates, TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances, and load factor breakdown via MetricsRegistry. Core implementation: - MetricsRegistry class with synchronous instruments (Counter, Histogram) for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ, CountedObject, LoadFactor, and NodeStore state polling. - ServiceRegistry extended with getMetricsRegistry() virtual method. - Application wires MetricsRegistry lifecycle (create/start/stop). - PerfLogImp instrumented to emit OTel metrics on RPC and Job events. Dashboards & observability: - 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ. - Extended statsd-node-health dashboard with NodeStore, Cache, and CountedObject panels. - 10 alerting rules added to telemetry-runbook.md. - Integration test extended with 12 OTel metric validation checks. Documentation: - 09-data-collection-reference.md updated with Phase 9 metric tables. - Unit tests for MetricsRegistry disabled-path (no-op) behavior. All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -231,7 +231,7 @@ When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yam
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
Eight dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
Thirteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
|
||||
### RPC Performance (`rippled-rpc-perf`)
|
||||
|
||||
@@ -403,8 +403,74 @@ count_over_time({job="rippled"} |= "trace_id=" [5m])
|
||||
4. Open Grafana at http://localhost:3000 -> Explore -> Loki and search for `{job="rippled"} |= "trace_id="`.
|
||||
5. Click the TraceID link to navigate to the corresponding trace in Tempo.
|
||||
|
||||
## Phase 9: OTel Metrics Alerting Rules
|
||||
|
||||
The following alerting rules are recommended for the Phase 9 OTel SDK metrics.
|
||||
Add to your Prometheus alerting rules configuration.
|
||||
|
||||
### NodeStore
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| --------------------------- | -------- | ---------------------------------------------------- | --- | ------------------------------------------------------- |
|
||||
| `NodeStoreHighWriteLoad` | Warning | `rippled_nodestore_state{metric="write_load"} > 100` | 5m | NodeStore backend is under sustained write pressure |
|
||||
| `NodeStoreReadQueueBacklog` | Warning | `rippled_nodestore_state{metric="read_queue"} > 500` | 5m | Prefetch thread pool is saturated; reads are backing up |
|
||||
|
||||
### Cache
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| ----------------------- | -------- | ------------------------------------------------------- | --- | ------------------------------------------------------ |
|
||||
| `SLECacheHitRateLow` | Warning | `rippled_cache_metrics{metric="SLE_hit_rate"} < 0.5` | 10m | SLE cache is thrashing; consider increasing cache size |
|
||||
| `LedgerCacheHitRateLow` | Warning | `rippled_cache_metrics{metric="ledger_hit_rate"} < 0.5` | 10m | Ledger cache hit rate is degraded |
|
||||
|
||||
### Transaction Queue
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| ---------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------- | --- | -------------------------------------------------- |
|
||||
| `TxQNearCapacity` | Warning | `rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"} > 0.8` | 5m | TxQ is >80% full; transactions may be rejected |
|
||||
| `TxQHighFeeEscalation` | Warning | `rippled_txq_metrics{metric="txq_open_ledger_fee_level"} / rippled_txq_metrics{metric="txq_reference_fee_level"} > 10` | 5m | Fee escalation is 10x above reference; high demand |
|
||||
|
||||
### Load Factor
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| --------------------- | -------- | -------------------------------------------------------------- | --- | -------------------------------------------------------------- |
|
||||
| `HighLoadFactor` | Warning | `rippled_load_factor_metrics{metric="load_factor"} > 5` | 10m | Combined load factor is elevated; transactions cost 5x+ normal |
|
||||
| `HighLocalLoadFactor` | Critical | `rippled_load_factor_metrics{metric="load_factor_local"} > 10` | 5m | Local server load is critically elevated |
|
||||
|
||||
### RPC Performance
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| ------------------ | -------- | ---------------------------------------------------------------------------------------------------------- | --- | --------------------------------- |
|
||||
| `HighRPCErrorRate` | Warning | `sum(rate(rippled_rpc_method_errored_total[5m])) / sum(rate(rippled_rpc_method_started_total[5m])) > 0.05` | 5m | >5% of RPC calls are erroring |
|
||||
| `SlowRPCLatency` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_rpc_method_duration_us_bucket[5m]))) > 5000000` | 5m | RPC p95 latency exceeds 5 seconds |
|
||||
|
||||
### Job Queue
|
||||
|
||||
| Alert Name | Severity | Condition | For | Description |
|
||||
| ------------------ | -------- | ----------------------------------------------------------------------------------------------------- | --- | ---------------------------------------------------- |
|
||||
| `JobQueueBacklog` | Warning | `sum(rate(rippled_job_queued_total[5m])) - sum(rate(rippled_job_finished_total[5m])) > 100` | 5m | Jobs are being queued faster than they're completing |
|
||||
| `SlowJobExecution` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_job_running_duration_us_bucket[5m]))) > 10000000` | 5m | Job execution p95 exceeds 10 seconds |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No OTel SDK metrics in Prometheus
|
||||
|
||||
1. Verify `enabled=1` in the `[telemetry]` config section
|
||||
2. Check that `metrics_endpoint` points to the OTel Collector's HTTP receiver
|
||||
(default: `http://localhost:4318/v1/metrics`)
|
||||
3. Check rippled logs for `MetricsRegistry: started successfully` message
|
||||
4. Verify the OTel Collector is configured with an OTLP receiver and Prometheus exporter
|
||||
5. Check Prometheus targets page for the collector scrape target
|
||||
|
||||
### Cache hit rates are zero
|
||||
|
||||
Cache hit rates may be zero during startup before caches are warmed. Wait for the
|
||||
node to reach `Full` operating mode and process several ledgers before investigating.
|
||||
|
||||
### NodeStore I/O counters not incrementing
|
||||
|
||||
NodeStore counters are cumulative and may appear flat if the node is idle. Submit
|
||||
some transactions or RPC requests to generate I/O activity.
|
||||
|
||||
### No traces appearing in Jaeger
|
||||
|
||||
1. Check rippled logs for `Telemetry starting` message
|
||||
|
||||
Reference in New Issue
Block a user