diff --git a/.github/scripts/levelization/results/loops.txt b/.github/scripts/levelization/results/loops.txt index 1110b0b298..e5d8dd4c1f 100644 --- a/.github/scripts/levelization/results/loops.txt +++ b/.github/scripts/levelization/results/loops.txt @@ -17,8 +17,11 @@ Loop: xrpld.app xrpld.shamap xrpld.shamap ~= xrpld.app Loop: xrpld.app xrpld.telemetry - xrpld.telemetry ~= xrpld.app + xrpld.telemetry == xrpld.app Loop: xrpld.overlay xrpld.rpc xrpld.rpc ~= xrpld.overlay +Loop: xrpld.overlay xrpld.telemetry + xrpld.telemetry == xrpld.overlay + diff --git a/.github/scripts/levelization/results/ordering.txt b/.github/scripts/levelization/results/ordering.txt index 2e7ff014fd..256fe4d1fc 100644 --- a/.github/scripts/levelization/results/ordering.txt +++ b/.github/scripts/levelization/results/ordering.txt @@ -257,7 +257,6 @@ xrpld.overlay > xrpl.basics xrpld.overlay > xrpl.core xrpld.overlay > xrpld.core xrpld.overlay > xrpld.peerfinder -xrpld.overlay > xrpld.telemetry xrpld.overlay > xrpl.json xrpld.overlay > xrpl.protocol xrpld.overlay > xrpl.rdb @@ -290,5 +289,7 @@ xrpld.shamap > xrpl.shamap xrpld.telemetry > xrpl.basics xrpld.telemetry > xrpl.core xrpld.telemetry > xrpl.nodestore +xrpld.telemetry > xrpl.protocol +xrpld.telemetry > xrpl.rdb xrpld.telemetry > xrpl.server xrpld.telemetry > xrpl.telemetry diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 75e62895c2..9001892bb5 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -671,7 +671,7 @@ flowchart LR ### Motivation -Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~50+ metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most — NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts. +Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~68 metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most — NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts. ### Architecture @@ -747,6 +747,7 @@ flowchart TB | 9.5 | PerfLog per-job metrics | | 9.6 | Counted object instance metrics | | 9.7 | Fee escalation & load factor metrics | +| 9.7a | push_metrics.py parity gauges | | 9.8 | New Grafana dashboards (2 new, 2 updated) | | 9.9 | Update documentation | | 9.10 | Integration tests | @@ -755,7 +756,7 @@ See [Phase9_taskList.md](./Phase9_taskList.md) for detailed per-task breakdown. ### Exit Criteria -- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline +- [ ] All ~68 new metrics visible in Prometheus via OTLP pipeline - [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK - [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue) - [ ] No performance regression (< 0.5% CPU overhead from new callbacks) @@ -1130,7 +1131,6 @@ Clear, measurable criteria for each phase. ### 6.12.1 Phase 1: Core Infrastructure - | Criterion | Measurement | Target | | --------------- | ---------------------------------------------------------- | ---------------------------- | | SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles | @@ -1143,7 +1143,6 @@ Clear, measurable criteria for each phase. ### 6.12.2 Phase 2: RPC Tracing - | Criterion | Measurement | Target | | ------------------ | ---------------------------------- | -------------------------- | | Coverage | All RPC commands instrumented | 100% of commands | @@ -1154,10 +1153,8 @@ Clear, measurable criteria for each phase. **Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution. - ### 6.12.3 Phase 3: Transaction Tracing - | Criterion | Measurement | Target | | ---------------- | ------------------------------- | ---------------------------------- | | Local Trace | Submit → validate → TxQ traced | Single-node test passes | @@ -1170,7 +1167,6 @@ Clear, measurable criteria for each phase. ### 6.12.4 Phase 4: Consensus Tracing - | Criterion | Measurement | Target | | -------------------- | ----------------------------- | ------------------------- | | Round Tracing | startRound creates root span | Unit test passes | @@ -1183,7 +1179,6 @@ Clear, measurable criteria for each phase. ### 6.12.5 Phase 5: Production Deployment - | Criterion | Measurement | Target | | ------------ | ---------------------------- | -------------------------- | | Collector HA | Multiple collectors deployed | No single point of failure | @@ -1207,7 +1202,7 @@ Clear, measurable criteria for each phase. | Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | Active | | Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | Active | | Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | Active | -| Phase 9 | 50+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement | +| Phase 9 | 68+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement | | Phase 10 | Full telemetry stack validated | < 3% CPU overhead proven | End of Week 17 | Future Enhancement | | Phase 11 | Third-party metrics via receiver | 4 new dashboards + alerting | End of Week 20 | Future Enhancement | diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index b6e12fd318..0ed3f7b070 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -259,7 +259,6 @@ This guide maps Phase 9–11 content to its location across the documentation. | New dashboards (4) | Validator Health, Network Topology, Fee Market (External), DEX & AMM | **Consumer categories**: Exchanges, Payment Processors, DeFi/AMM, NFT Marketplaces, Analytics Providers, Wallets, Compliance, Academic Researchers, Institutional Custody, CBDC Bridge Operators. ->>>>>>> 58b5170180 (Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards) --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index e208c38e09..d01f98a350 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -650,6 +650,56 @@ Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, ` | `rippled_load_factor_fee_escalation` | Gauge | Open ledger fee escalation | | `rippled_load_factor_fee_queue` | Gauge | Queue entry fee level | +#### Server Info (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Labels | Description | +| ----------------------------------------------------------- | ----- | -------- | -------------------------------------------- | +| `rippled_server_info{metric="server_state"}` | Gauge | `metric` | Operating mode (0=DISCONNECTED .. 4=FULL) | +| `rippled_server_info{metric="uptime"}` | Gauge | `metric` | Seconds since server start | +| `rippled_server_info{metric="peers"}` | Gauge | `metric` | Total connected peers | +| `rippled_server_info{metric="validated_ledger_seq"}` | Gauge | `metric` | Validated ledger sequence number | +| `rippled_server_info{metric="ledger_current_index"}` | Gauge | `metric` | Current open ledger sequence | +| `rippled_server_info{metric="peer_disconnects_resources"}` | Gauge | `metric` | Cumulative resource-related peer disconnects | +| `rippled_server_info{metric="last_close_proposers"}` | Gauge | `metric` | Proposers in last closed round | +| `rippled_server_info{metric="last_close_converge_time_ms"}` | Gauge | `metric` | Last close convergence time (milliseconds) | + +#### Build Info (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Labels | Description | +| ------------------------------------- | ----- | --------- | --------------------------------- | +| `rippled_build_info{version=""}` | Gauge | `version` | Info-style metric, always value 1 | + +#### Complete Ledger Ranges (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Labels | Description | +| ----------------------------------------------------- | ----- | --------------- | --------------------------- | +| `rippled_complete_ledgers{bound="start",index=""}` | Gauge | `bound`,`index` | Start of contiguous range N | +| `rippled_complete_ledgers{bound="end",index=""}` | Gauge | `bound`,`index` | End of contiguous range N | + +#### Database Metrics (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Labels | Description | +| --------------------------------------------------- | ----- | -------- | --------------------------------- | +| `rippled_db_metrics{metric="db_kb_total"}` | Gauge | `metric` | Total database size (KB) | +| `rippled_db_metrics{metric="db_kb_ledger"}` | Gauge | `metric` | Ledger database size (KB) | +| `rippled_db_metrics{metric="db_kb_transaction"}` | Gauge | `metric` | Transaction database size (KB) | +| `rippled_db_metrics{metric="historical_perminute"}` | Gauge | `metric` | Historical ledger fetches per min | + +#### Extended Cache Metrics (additions to existing rippled_cache_metrics) + +| Prometheus Metric | Type | Labels | Description | +| ----------------------------------------- | ----- | -------- | ------------------------- | +| `rippled_cache_metrics{metric="AL_size"}` | Gauge | `metric` | AcceptedLedger cache size | + +#### Extended NodeStore Metrics (additions to existing rippled_nodestore_state) + +| Prometheus Metric | Type | Labels | Description | +| ---------------------------------------------------------- | ----- | -------- | ----------------------------------- | +| `rippled_nodestore_state{metric="node_reads_duration_us"}` | Gauge | `metric` | Cumulative read time (microseconds) | +| `rippled_nodestore_state{metric="read_request_bundle"}` | Gauge | `metric` | Read request bundle count | +| `rippled_nodestore_state{metric="read_threads_running"}` | Gauge | `metric` | Active read threads | +| `rippled_nodestore_state{metric="read_threads_total"}` | Gauge | `metric` | Total read threads configured | + ### New Grafana Dashboards (Phase 9) | Dashboard | UID | Data Source | Key Panels | @@ -674,7 +724,7 @@ Phase 10 builds a 5-node validator docker-compose harness with RPC load generato | Trace spans | 16 | Jaeger/Tempo API query | | Span attributes | 22 | Per-span attribute assertion | | StatsD metrics | 255+ | Prometheus query | -| Phase 9 metrics | 50+ | Prometheus query | +| Phase 9 metrics | 68+ | Prometheus query | | SpanMetrics RED | 4 per span | Prometheus query | | Grafana dashboards | 10 | Dashboard API "no data" check | | Log-trace links | Present | Loki query + Tempo reverse check | diff --git a/OpenTelemetryPlan/Phase9_taskList.md b/OpenTelemetryPlan/Phase9_taskList.md index 1b383592f9..69af4d9263 100644 --- a/OpenTelemetryPlan/Phase9_taskList.md +++ b/OpenTelemetryPlan/Phase9_taskList.md @@ -231,6 +231,48 @@ These metrics serve multiple external consumer categories identified during rese --- +## Task 9.7a: push_metrics.py Parity — Missing Observable Gauges + +**Objective**: Fill the remaining metric gaps between the external `push_metrics.py` script (in `ripplex-ansible`) and the internal OTel `MetricsRegistry` observable gauges. After this task, all metrics collected by `push_metrics.py` that CAN be collected internally are covered. + +**What was done**: + +- Extended existing `cacheHitRateGauge_` callback with `AL_size` (AcceptedLedger cache size) +- Extended existing `nodeStoreGauge_` callback with 4 new metrics from `getCountsJson()`: + - `node_reads_duration_us` (JSON string — uses `std::stoll(asString())`) + - `read_request_bundle` (native JSON int) + - `read_threads_running` (native JSON int) + - `read_threads_total` (native JSON int) +- Added new `rippled_server_info` Int64ObservableGauge with 8 metrics: + - `server_state` — operating mode as int (0=DISCONNECTED .. 4=FULL) + - `uptime` — seconds since server start + - `peers` — total peer count + - `validated_ledger_seq` — validated ledger sequence (atomic read) + - `ledger_current_index` — current open ledger sequence + - `peer_disconnects_resources` — cumulative resource-related disconnects + - `last_close_proposers` — from `getConsensusInfo()["previous_proposers"]` + - `last_close_converge_time_ms` — from `getConsensusInfo()["previous_mseconds"]` +- Added new `rippled_build_info` Int64ObservableGauge (info-style, value=1 with `version` label) +- Added new `rippled_complete_ledgers` Int64ObservableGauge parsing comma-separated ranges into `{bound, index}` pairs +- Added new `rippled_db_metrics` Int64ObservableGauge with 4 metrics: + - `db_kb_total`, `db_kb_ledger`, `db_kb_transaction` (SQLite stat queries) + - `historical_perminute` (historical ledger fetch rate) + +**Key modified files**: + +- `src/xrpld/telemetry/MetricsRegistry.h` (4 new gauge members, updated ASCII diagram) +- `src/xrpld/telemetry/MetricsRegistry.cpp` (4 new callback registrations, 2 callback extensions) + +**Not implementable inside rippled**: + +- `connection_count_51233/51234` — OS-level port connection counts from external shell script (`get_connection.sh`) + +**Derived Prometheus metrics**: `rippled_server_info{metric="server_state"}`, `rippled_build_info{version="2.4.0"}`, `rippled_complete_ledgers{bound="start",index="0"}`, `rippled_db_metrics{metric="db_kb_total"}`, etc. + +**Grafana dashboard**: New panels added to _Node Health_ dashboard (`system-node-health.json`). + +--- + ## Task 9.8: New Grafana Dashboards **Objective**: Create Grafana dashboards for the new metric categories. diff --git a/cspell.config.yaml b/cspell.config.yaml index d609a50e98..c8faea67c5 100644 --- a/cspell.config.yaml +++ b/cspell.config.yaml @@ -331,3 +331,5 @@ words: - xxhasher - xychart - zpages + - ripplex + - mseconds diff --git a/docker/telemetry/grafana/dashboards/system-node-health.json b/docker/telemetry/grafana/dashboards/system-node-health.json index 546a5f12a2..396c89a774 100644 --- a/docker/telemetry/grafana/dashboards/system-node-health.json +++ b/docker/telemetry/grafana/dashboards/system-node-health.json @@ -720,6 +720,425 @@ }, "overrides": [] } + }, + { + "title": "--- OTel: Server Info ---", + "type": "row", + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 59 + }, + "collapsed": false, + "panels": [] + }, + { + "title": "Server State", + "description": "Current operating mode: 0=DISCONNECTED, 1=CONNECTED, 2=SYNCING, 3=TRACKING, 4=FULL. Sourced from MetricsRegistry server_info observable gauge.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 6, + "x": 0, + "y": 60 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"server_state\"}", + "legendFormat": "State [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "mappings": [ + { + "type": "value", + "options": { "0": { "text": "DISCONNECTED", "color": "red" } } + }, + { + "type": "value", + "options": { "1": { "text": "CONNECTED", "color": "orange" } } + }, + { + "type": "value", + "options": { "2": { "text": "SYNCING", "color": "yellow" } } + }, + { + "type": "value", + "options": { "3": { "text": "TRACKING", "color": "blue" } } + }, + { + "type": "value", + "options": { "4": { "text": "FULL", "color": "green" } } + } + ], + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Uptime", + "description": "Time since server started, in seconds. Sourced from MetricsRegistry server_info observable gauge via UptimeClock.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 6, + "x": 6, + "y": 60 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"uptime\"}", + "legendFormat": "Uptime [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Peer Count", + "description": "Total connected peers (inbound + outbound). Sourced from MetricsRegistry server_info observable gauge via overlay().size().", + "type": "stat", + "gridPos": { + "h": 8, + "w": 6, + "x": 12, + "y": 60 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"peers\"}", + "legendFormat": "Peers [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Validated Ledger Seq", + "description": "Sequence number of the most recently validated ledger. Returns 0 before first validation. Sourced from MetricsRegistry server_info observable gauge.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 6, + "x": 18, + "y": 60 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"validated_ledger_seq\"}", + "legendFormat": "Seq [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Last Close Info", + "description": "Proposers and convergence time from the last closed consensus round. Sourced from MetricsRegistry server_info observable gauge via getConsensusInfo().", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 68 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"last_close_proposers\"}", + "legendFormat": "Proposers [{{exported_instance}}]" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"last_close_converge_time_ms\"}", + "legendFormat": "Converge Time ms [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": { + "drawStyle": "line", + "lineWidth": 2, + "fillOpacity": 10 + }, + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + } + }, + { + "title": "Build Version", + "description": "Build version info metric. Value is always 1; version string is in the 'version' label. Sourced from MetricsRegistry build_info observable gauge.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 68 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + }, + "textMode": "name" + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_build_info{exported_instance=~\"$node\"}", + "legendFormat": "v{{version}} [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "--- OTel: Complete Ledgers & DB ---", + "type": "row", + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 76 + }, + "collapsed": false, + "panels": [] + }, + { + "title": "Complete Ledger Ranges", + "description": "Start and end of each contiguous complete ledger range. Parsed from getLedgerMaster().getCompleteLedgers() string. Sourced from MetricsRegistry complete_ledgers observable gauge.", + "type": "table", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 77 + }, + "options": { + "showHeader": true + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_complete_ledgers{exported_instance=~\"$node\"}", + "legendFormat": "{{bound}} [range {{index}}] [{{exported_instance}}]", + "format": "table", + "instant": true + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Database Sizes", + "description": "SQLite database sizes in KB (total, ledger, transaction). Sourced from MetricsRegistry db_metrics observable gauge via getRelationalDatabase().getKBUsed*().", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 77 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_db_metrics{exported_instance=~\"$node\", metric=\"db_kb_total\"}", + "legendFormat": "Total KB [{{exported_instance}}]" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_db_metrics{exported_instance=~\"$node\", metric=\"db_kb_ledger\"}", + "legendFormat": "Ledger KB [{{exported_instance}}]" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_db_metrics{exported_instance=~\"$node\", metric=\"db_kb_transaction\"}", + "legendFormat": "Transaction KB [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "deckbytes", + "custom": { + "axisLabel": "Size (KB)", + "drawStyle": "line", + "lineWidth": 2, + "fillOpacity": 10 + }, + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + } + }, + { + "title": "Historical Fetch Rate", + "description": "Historical ledger fetches per minute. Sourced from MetricsRegistry db_metrics observable gauge via getInboundLedgers().fetchRate().", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 85 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_db_metrics{exported_instance=~\"$node\", metric=\"historical_perminute\"}", + "legendFormat": "Fetches/min [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": {} + }, + "overrides": [] + } + }, + { + "title": "Peer Disconnects (Resources)", + "description": "Cumulative count of peer disconnections due to resource limits. Sourced from MetricsRegistry server_info observable gauge via overlay().getPeerDisconnectCharges().", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 85 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_server_info{exported_instance=~\"$node\", metric=\"peer_disconnects_resources\"}", + "legendFormat": "Resource Disconnects [{{exported_instance}}]" + } + ], + "fieldConfig": { + "defaults": { + "unit": "none", + "custom": { + "drawStyle": "line", + "lineWidth": 2, + "fillOpacity": 10 + }, + "color": { + "mode": "palette-classic" + } + }, + "overrides": [] + } } ], "schemaVersion": 39, diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index 7334df4168..8c449d3b6e 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -209,6 +209,32 @@ When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yam | `rippled_{category}_Bytes_In/Out` | OverlayImpl.h:535 | Overlay traffic bytes per category (57 categories) | | `rippled_{category}_Messages_In/Out` | OverlayImpl.h:535 | Overlay traffic messages per category | +#### OTel MetricsRegistry Gauges (Phase 9) + +These gauges are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s interval), NOT through beast::insight. + +| Prometheus Metric | Source | Description | +| ----------------------------------------------------------- | ------------------- | -------------------------------------------- | +| `rippled_server_info{metric="server_state"}` | MetricsRegistry.cpp | Operating mode (0=DISCONNECTED .. 4=FULL) | +| `rippled_server_info{metric="uptime"}` | MetricsRegistry.cpp | Seconds since server start | +| `rippled_server_info{metric="peers"}` | MetricsRegistry.cpp | Total connected peers | +| `rippled_server_info{metric="validated_ledger_seq"}` | MetricsRegistry.cpp | Validated ledger sequence number | +| `rippled_server_info{metric="ledger_current_index"}` | MetricsRegistry.cpp | Current open ledger sequence | +| `rippled_server_info{metric="peer_disconnects_resources"}` | MetricsRegistry.cpp | Cumulative resource-related peer disconnects | +| `rippled_server_info{metric="last_close_proposers"}` | MetricsRegistry.cpp | Proposers in last closed round | +| `rippled_server_info{metric="last_close_converge_time_ms"}` | MetricsRegistry.cpp | Last close convergence time (ms) | +| `rippled_build_info{version=""}` | MetricsRegistry.cpp | Info-style metric (always 1) | +| `rippled_complete_ledgers{bound="start\|end",index=""}` | MetricsRegistry.cpp | Complete ledger range start/end pairs | +| `rippled_db_metrics{metric="db_kb_total"}` | MetricsRegistry.cpp | Total database size (KB) | +| `rippled_db_metrics{metric="db_kb_ledger"}` | MetricsRegistry.cpp | Ledger database size (KB) | +| `rippled_db_metrics{metric="db_kb_transaction"}` | MetricsRegistry.cpp | Transaction database size (KB) | +| `rippled_db_metrics{metric="historical_perminute"}` | MetricsRegistry.cpp | Historical ledger fetches per minute | +| `rippled_cache_metrics{metric="AL_size"}` | MetricsRegistry.cpp | AcceptedLedger cache size | +| `rippled_nodestore_state{metric="node_reads_duration_us"}` | MetricsRegistry.cpp | Cumulative read time (microseconds) | +| `rippled_nodestore_state{metric="read_request_bundle"}` | MetricsRegistry.cpp | Read request bundle count | +| `rippled_nodestore_state{metric="read_threads_running"}` | MetricsRegistry.cpp | Active read threads | +| `rippled_nodestore_state{metric="read_threads_total"}` | MetricsRegistry.cpp | Total read threads configured | + #### Counters | Prometheus Metric | Source | Description | @@ -300,16 +326,24 @@ Requires `trace_peer=1` in the `[telemetry]` config section. ### Node Health — System Metrics (`rippled-system-node-health`) -| Panel | Type | PromQL | Labels Used | -| -------------------------- | ---------- | ------------------------------------------------------ | ----------- | -| Validated Ledger Age | stat | `rippled_LedgerMaster_Validated_Ledger_Age` | — | -| Published Ledger Age | stat | `rippled_LedgerMaster_Published_Ledger_Age` | — | -| Operating Mode Duration | timeseries | `rippled_State_Accounting_*_duration` | — | -| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions` | — | -| I/O Latency | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | — | -| Job Queue Depth | timeseries | `rippled_job_count` | — | -| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — | -| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — | +| Panel | Type | PromQL | Labels Used | +| -------------------------- | ---------- | ------------------------------------------------------ | ---------------- | +| Validated Ledger Age | stat | `rippled_LedgerMaster_Validated_Ledger_Age` | — | +| Published Ledger Age | stat | `rippled_LedgerMaster_Published_Ledger_Age` | — | +| Operating Mode Duration | timeseries | `rippled_State_Accounting_*_duration` | — | +| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions` | — | +| I/O Latency | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | — | +| Job Queue Depth | timeseries | `rippled_job_count` | — | +| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — | +| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — | +| Server State | stat | `rippled_server_info{metric="server_state"}` | `metric` | +| Uptime | stat | `rippled_server_info{metric="uptime"}` | `metric` | +| Peer Count | stat | `rippled_server_info{metric="peers"}` | `metric` | +| Validated Ledger Seq | stat | `rippled_server_info{metric="validated_ledger_seq"}` | `metric` | +| Build Version | stat | `rippled_build_info` | `version` | +| Complete Ledger Ranges | table | `rippled_complete_ledgers` | `bound`, `index` | +| Database Sizes | timeseries | `rippled_db_metrics{metric=~"db_kb_.*"}` | `metric` | +| Historical Fetch Rate | stat | `rippled_db_metrics{metric="historical_perminute"}` | `metric` | ### Network Traffic — System Metrics (`rippled-system-network`) @@ -420,6 +454,17 @@ count_over_time({job="rippled"} |= "trace_id=" [5m]) 4. Check that the `otlp` receiver is in the metrics pipeline receivers in `otel-collector-config.yaml` 5. Query Prometheus directly: `curl 'http://localhost:9090/api/v1/query?query=rippled_job_count'` +### Server info gauge shows server_state=0 + +This is normal during startup. The server starts in DISCONNECTED mode (0) and +progresses through CONNECTED (1), SYNCING (2), TRACKING (3), to FULL (4). +Wait for the node to sync with the network. + +### Database metrics showing zero + +The `getKBUsed*()` methods require SQLite databases to exist. If running with +`--standalone` or before the first ledger is stored, these will be zero. + ### High memory usage - Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling)