diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index 4101f74771..0af035b404 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -556,6 +556,8 @@ span->SetAttribute("peer.id", peerId); ### 2.6.4 Coexistence Strategy +> **Note**: Phase 7 replaces the StatsD bridge with native OTel Metrics SDK export. The diagram below shows the Phase 6 intermediate state. See [Phase7_taskList.md](./Phase7_taskList.md) for the migration design where Beast Insight emits via OTLP instead of StatsD. + ```mermaid flowchart TB subgraph rippled["rippled Process"] @@ -584,6 +586,8 @@ flowchart TB - **OpenTelemetry to OTLP Collector**: OTel exports spans over OTLP/gRPC to a Collector, which then forwards to a trace backend (Tempo). - **Grafana (red, unified UI)**: All three data streams converge in Grafana, enabling operators to correlate logs, metrics, and traces in a single dashboard. +**Phase 7 target state**: Beast Insight routes to `OTelCollector` (new `Collector` implementation) which exports via OTLP/HTTP to the same collector endpoint as traces. StatsD UDP path becomes a deprecated fallback (`[insight] server=statsd`). See [06-implementation-phases.md §6.8](./06-implementation-phases.md) and [Phase7_taskList.md](./Phase7_taskList.md) for details. + ### 2.6.5 Correlation with PerfLog Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging: diff --git a/OpenTelemetryPlan/05-configuration-reference.md b/OpenTelemetryPlan/05-configuration-reference.md index 5d8e0cd105..a69e60cb5e 100644 --- a/OpenTelemetryPlan/05-configuration-reference.md +++ b/OpenTelemetryPlan/05-configuration-reference.md @@ -921,18 +921,22 @@ jsonData: filterBySpanID: false ``` -### 5.8.7 Correlation with Insight/StatsD Metrics +### 5.8.7 Correlation with Insight/OTel System Metrics -To correlate traces with existing Beast Insight metrics: +To correlate traces with Beast Insight system metrics: **Step 1: Export Insight metrics to Prometheus** -```yaml -# prometheus.yaml -scrape_configs: - - job_name: "rippled-statsd" - static_configs: - - targets: ["statsd-exporter:9102"] +Beast Insight metrics are exported natively via OTLP to the OTel Collector, +which exposes them on the Prometheus endpoint alongside spanmetrics. No +separate StatsD exporter is needed when using `server=otel`. + +```ini +# xrpld.cfg — native OTel metrics (recommended) +[insight] +server=otel +endpoint=http://localhost:4318/v1/metrics +prefix=rippled ``` **Step 2: Add exemplars to metrics** diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 643aa29392..65f9b45577 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -355,7 +355,187 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi - [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`) - [ ] All 3 new Grafana dashboards load without errors - [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests) -- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately) +- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping) + +--- + +## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12) + +**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. + +### Motivation: Why Migrate from StatsD to Native OTel Metrics + +The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves. + +#### What We Gain + +1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." + +2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. + +3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). + +4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. + +5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. + +6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. + +7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. + +#### What We Lose + +1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. + +2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. + +3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. + +4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. + +#### Decision + +The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. + +### Architecture + +#### Class Hierarchy (after Phase 7) + +``` +beast::insight::Collector (abstract interface — unchanged) + | + +-- StatsDCollector (existing — retained as fallback, deprecated) + | +-- StatsDCounterImpl -> StatsD |c over UDP + | +-- StatsDGaugeImpl -> StatsD |g over UDP + | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) + | +-- StatsDEventImpl -> StatsD |ms over UDP + | +-- StatsDHookImpl -> 1s periodic callback + | + +-- NullCollector (existing — unchanged, used when disabled) + | +-- NullCounterImpl -> no-op + | +-- NullGaugeImpl -> no-op + | +-- NullMeterImpl -> no-op + | +-- NullEventImpl -> no-op + | +-- NullHookImpl -> no-op + | + +-- OTelCollector (NEW — Phase 7) + +-- OTelCounterImpl -> otel::Counter + +-- OTelGaugeImpl -> otel::ObservableGauge + +-- OTelMeterImpl -> otel::Counter + +-- OTelEventImpl -> otel::Histogram + +-- OTelHookImpl -> 1s periodic callback (same pattern) +``` + +#### Data Flow (after Phase 7) + +```mermaid +graph LR + subgraph rippledNode["rippled Node"] + A["Trace Macros
XRPL_TRACE_SPAN"] + B["beast::insight
OTelCollector"] + end + + subgraph collector["OTel Collector :4317 / :4318"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + BP["Batch Processor"] + SM["SpanMetrics Connector"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends"] + D["Jaeger / Tempo"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000"] + end + + A -->|"OTLP/HTTP :4318
(traces)"| R1 + B -->|"OTLP/HTTP :4318
(metrics)"| R1 + + BP -->|"OTLP/gRPC"| D + SM -->|"RED metrics"| E + R1 -->|"rippled_* metrics
(native OTLP)"| E + + E --> F + D --> F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. + +#### Configuration + +```ini +# [insight] section — new "otel" server option +[insight] +server=otel # NEW: uses OTel OTLP metrics exporter +prefix=rippled # metric name prefix (preserved) + +# Endpoint and auth inherited from [telemetry] section: +[telemetry] +enabled=1 +endpoint=http://localhost:4318/v1/traces +``` + +The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. + +**Backward compatibility**: `server=statsd` continues to work exactly as before. + +See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. + +### Instrument Type Mapping + +| beast::insight | OTel Metrics SDK | Rationale | +| ---------------------- | -------------------------------- | ---------------------------------------------------------------- | +| Counter (int64, `\|c`) | `Counter` | Direct 1:1 mapping | +| Gauge (uint64, `\|g`) | `ObservableGauge` | Async callback matches existing Hook polling pattern | +| Meter (uint64, `\|m`) | `Counter` | Fixes non-standard wire format; meters are semantically counters | +| Event (ms, `\|ms`) | `Histogram` | Duration distributions with explicit bucket boundaries | +| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval | + +### Tasks + +| Task | Description | +| ---- | ------------------------------------------------------------------------- | +| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) | +| 7.2 | Implement `OTelCollector` class (~400-500 lines) | +| 7.3 | Update `CollectorManager` — add `server=otel` | +| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) | +| 7.5 | Preserve metric names in Prometheus (naming strategy) | +| 7.6 | Update Grafana dashboards (if names change) | +| 7.7 | Update integration tests | +| 7.8 | Update documentation (runbook, reference docs) | + +### Exit Criteria + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant --- @@ -636,14 +816,15 @@ Clear, measurable criteria for each phase. ### 6.13.6 Success Metrics Summary - -| Phase | Primary Metric | Secondary Metric | Deadline | -| ------- | ---------------------- | --------------------------- | ------------- | -| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | -| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | -| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | -| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | -| Phase 5 | Production deployment | Operators trained | End of Week 9 | +| Phase | Primary Metric | Secondary Metric | Deadline | +| ------- | ---------------------------- | --------------------------- | -------------- | +| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | +| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | +| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | +| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | +| Phase 5 | Production deployment | Operators trained | End of Week 9 | +| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | +| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | --- diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index 660c4f845d..73eac02583 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -195,6 +195,7 @@ flowchart TB | [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing | | [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing | | [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests | +| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration | | [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview | --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index fb91e676bc..2c13fe48b5 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -10,13 +10,12 @@ graph LR subgraph rippledNode["rippled Node"] A["Trace Macros
XRPL_TRACE_SPAN
(OTLP/HTTP exporter)"] - B["beast::insight
StatsD metrics
(UDP sender)"] + B["beast::insight
OTel native metrics
(OTLP/HTTP exporter)"] end - subgraph collector["OTel Collector :4317 / :4318 / :8125"] + subgraph collector["OTel Collector :4317 / :4318"] direction TB - R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] - R2["StatsD Receiver
:8125 UDP"] + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP
(traces + metrics)"] BP["Batch Processor
timeout 1s, batch 100"] SM["SpanMetrics Connector
derives RED metrics
from trace spans"] @@ -30,7 +29,7 @@ graph LR end subgraph metrics["Metrics Stack"] - E["Prometheus :9090
scrapes :8889
span-derived + StatsD metrics"] + E["Prometheus :9090
scrapes :8889
span-derived + system metrics"] end subgraph viz["Visualization"] @@ -38,22 +37,21 @@ graph LR end A -->|"OTLP/HTTP :4318
(traces + attributes)"| R1 - B -->|"UDP :8125
(gauges, counters, timers)"| R2 + B -->|"OTLP/HTTP :4318
(gauges, counters, histograms)"| R1 BP -->|"OTLP/gRPC :4317"| D BP -->|"OTLP/gRPC"| T SM -->|"span_calls_total
span_duration_ms
(6 dimension labels)"| E - R2 -->|"rippled_* gauges
rippled_* counters
rippled_* summaries"| E + R1 -->|"rippled_* gauges
rippled_* counters
rippled_* histograms"| E E -->|"Prometheus
data source"| F D -->|"Jaeger
data source"| F T -->|"Tempo
data source"| F style A fill:#4a90d9,color:#fff,stroke:#2a6db5 - style B fill:#d9534f,color:#fff,stroke:#b52d2d + style B fill:#4a90d9,color:#fff,stroke:#2a6db5 style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d - style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d style BP fill:#449d44,color:#fff,stroke:#2d6e2d style SM fill:#449d44,color:#fff,stroke:#2d6e2d style D fill:#f0ad4e,color:#000,stroke:#c78c2e @@ -67,10 +65,10 @@ graph LR style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de ``` -There are two independent telemetry pipelines entering a single **OTel Collector**: +There are two independent telemetry pipelines entering a single **OTel Collector** via the same OTLP receiver: 1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline. -2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus. +2. **beast::insight OTel Metrics** — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same **OTLP Receiver**. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP; `server=statsd` remains available as a fallback. **Trace backends** — The collector exports traces via OTLP/gRPC to one or both: @@ -268,14 +266,26 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er --- -## 2. StatsD Metrics (beast::insight) +## 2. System Metrics (beast::insight — OTel native) -> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory. +> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6/7 metric inventory. +> +> **Migration complete**: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The `beast::insight::Collector` interface and all metric names are preserved — only the wire protocol changed. `[insight] server=statsd` remains as a fallback. -These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans. +These are system-level metrics emitted by rippled's `beast::insight` framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans. ### Configuration +```ini +# Recommended: native OTel metrics via OTLP/HTTP +[insight] +server=otel +endpoint=http://localhost:4318/v1/metrics +prefix=rippled +``` + +Fallback (StatsD): + ```ini [insight] server=statsd @@ -305,7 +315,7 @@ prefix=rippled | `rippled_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) | | `rippled_job_count` | JobQueue.cpp | Current job queue depth | 0–100 (healthy) | -**Grafana dashboard**: _Node Health (StatsD)_ (`rippled-statsd-node-health`) +**Grafana dashboard**: _Node Health (System Metrics)_ (`rippled-system-node-health`) ### 2.2 Counters @@ -317,11 +327,11 @@ prefix=rippled | `rippled_warn` | Logic.h | Resource manager warnings issued | | `rippled_drop` | Logic.h | Resource manager drops (connections rejected) | -**Note**: `rippled_warn` and `rippled_drop` use non-standard StatsD meter type (`|m`). The OTel StatsD receiver only recognizes `|c`, `|g`, `|ms`, `|h`, `|s` — these metrics may be silently dropped. See Known Issues below. +**Note**: With `server=otel`, `rippled_warn` and `rippled_drop` are properly exported as OTel Counter instruments. The previous StatsD `|m` type limitation no longer applies. -**Grafana dashboard**: _RPC & Pathfinding (StatsD)_ (`rippled-statsd-rpc`) +**Grafana dashboard**: _RPC & Pathfinding (System Metrics)_ (`rippled-system-rpc`) -### 2.3 Histograms (from StatsD timers) +### 2.3 Histograms (Event timers) | Prometheus Metric | Source File | Unit | Description | | ----------------------- | ----------------- | ----- | ------------------------------ | @@ -361,7 +371,7 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo | `ping` / `status` | Keepalive and status | | `set_get` | Set requests | -**Grafana dashboards**: _Network Traffic_ (`rippled-statsd-network`), _Overlay Traffic Detail_ (`rippled-statsd-overlay-detail`), _Ledger Data & Sync_ (`rippled-statsd-ledger-sync`) +**Grafana dashboards**: _Network Traffic_ (`rippled-system-network`), _Overlay Traffic Detail_ (`rippled-system-overlay-detail`), _Ledger Data & Sync_ (`rippled-system-ledger-sync`) --- @@ -379,15 +389,15 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo | Ledger Operations | `rippled-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison | | Peer Network | `rippled-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown | -### 3.2 StatsD Dashboards (5) +### 3.2 System Metrics Dashboards (5) -| Dashboard | UID | Data Source | Key Panels | -| ---------------------- | ------------------------------- | ------------------- | --------------------------------------------------------------------------------- | -| Node Health | `rippled-statsd-node-health` | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate | -| Network Traffic | `rippled-statsd-network` | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category | -| RPC & Pathfinding | `rippled-statsd-rpc` | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops | -| Overlay Traffic Detail | `rippled-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths | -| Ledger Data & Sync | `rippled-statsd-ledger-sync` | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap | +| Dashboard | UID | Data Source | Key Panels | +| ---------------------- | ------------------------------- | ----------------- | --------------------------------------------------------------------------------- | +| Node Health | `rippled-system-node-health` | Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate | +| Network Traffic | `rippled-system-network` | Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category | +| RPC & Pathfinding | `rippled-system-rpc` | Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops | +| Overlay Traffic Detail | `rippled-system-overlay-detail` | Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths | +| Ledger Data & Sync | `rippled-system-ledger-sync` | Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap | ### 3.3 Accessing the Dashboards @@ -443,7 +453,7 @@ ledger.store (persist to DB) ## 5. Prometheus Query Examples -> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics. +> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus system metrics with trace-derived metrics. ### Span-Derived Metrics diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index bd79489b79..85fda4cdce 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -187,17 +187,19 @@ OpenTelemetry Collector configurations are provided for development and producti ## 6. Implementation Phases -The implementation spans 9 weeks across 5 phases: +The implementation spans 12 weeks across 7 phases: -| Phase | Duration | Focus | Key Deliverables | -| ----- | --------- | ------------------- | --------------------------------------------------- | -| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | -| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | -| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | -| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | -| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | +| Phase | Duration | Focus | Key Deliverables | +| ----- | ----------- | --------------------- | ----------------------------------------------------------- | +| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | +| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | +| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | +| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | +| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | +| 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards | +| 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation | -**Total Effort**: 47 person-days (2 developers working in parallel) +**Total Effort**: 60.6 developer-days with 2 developers ➡️ **[View full Implementation Phases](./06-implementation-phases.md)** diff --git a/OpenTelemetryPlan/Phase7_taskList.md b/OpenTelemetryPlan/Phase7_taskList.md new file mode 100644 index 0000000000..931235a8f4 --- /dev/null +++ b/OpenTelemetryPlan/Phase7_taskList.md @@ -0,0 +1,254 @@ +# Phase 7: Native OTel Metrics Migration — Task List + +> **Goal**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency. +> +> **Scope**: New `OTelCollectorImpl` class, `CollectorManager` config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests. +> +> **Branch**: `pratik/otel-phase7-native-metrics` (from `pratik/otel-phase6-statsd`) + +### Related Plan Documents + +| Document | Relevance | +| -------------------------------------------------------------------- | --------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 plan: motivation, architecture, exit criteria (§6.8) | +| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy | +| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved | + +--- + +## Task 7.1: Add OTel Metrics SDK to Build Dependencies + +**Objective**: Enable the OTel C++ Metrics SDK components in the build system. + +**What to do**: + +- Edit `conanfile.py`: + - Add OTel metrics SDK components to the dependency list when `telemetry=True` + - Components needed: `opentelemetry-cpp::metrics`, `opentelemetry-cpp::otlp_http_metric_exporter` + +- Edit `CMakeLists.txt` (telemetry section): + - Link `opentelemetry::metrics` and `opentelemetry::otlp_http_metric_exporter` targets + +**Key modified files**: + +- `conanfile.py` +- `CMakeLists.txt` (or the relevant telemetry cmake target) + +**Reference**: [05-configuration-reference.md §5.3](./05-configuration-reference.md) — CMake integration + +--- + +## Task 7.2: Implement OTelCollector Class + +**Objective**: Create the core `OTelCollector` implementation that maps beast::insight instruments to OTel Metrics SDK instruments. + +**What to do**: + +- Create `include/xrpl/beast/insight/OTelCollector.h`: + - Public factory: `static std::shared_ptr New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)` + - Derives from `StatsDCollector` (or directly from `Collector` — TBD based on shared code) + +- Create `src/libxrpl/beast/insight/OTelCollector.cpp` (~400-500 lines): + - **OTelCounterImpl**: Wraps `opentelemetry::metrics::Counter`. `increment(amount)` calls `counter->Add(amount)`. + - **OTelGaugeImpl**: Uses `opentelemetry::metrics::ObservableGauge` with an async callback. `set(value)` stores value atomically; callback reads it during collection. + - **OTelMeterImpl**: Wraps `opentelemetry::metrics::Counter`. `increment(amount)` calls `counter->Add(amount)`. Semantically identical to Counter but unsigned. + - **OTelEventImpl**: Wraps `opentelemetry::metrics::Histogram`. `notify(duration)` calls `histogram->Record(duration.count())`. Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms. + - **OTelHookImpl**: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader). + - **OTelCollectorImp**: Main class. + - Creates `MeterProvider` with `PeriodicMetricReader` (1s export interval) + - Creates `OtlpHttpMetricExporter` pointing to `[telemetry]` endpoint + - Sets resource attributes (service.name, service.instance.id) matching trace exporter + - Implements all `make_*()` factory methods + - Prefixes metric names with `[insight] prefix=` value + +- Guard all OTel SDK includes with `#ifdef XRPL_ENABLE_TELEMETRY` to compile to `NullCollector` equivalents when telemetry disabled. + +**Key new files**: + +- `include/xrpl/beast/insight/OTelCollector.h` +- `src/libxrpl/beast/insight/OTelCollector.cpp` + +**Key patterns to follow**: + +- Match `StatsDCollector.cpp` structure: private impl classes, intrusive list for metrics, strand-based thread safety +- Match existing telemetry code style from `src/libxrpl/telemetry/Telemetry.cpp` +- Use RAII for MeterProvider lifecycle (shutdown on destructor) + +**Reference**: [04-code-samples.md](./04-code-samples.md) — code style and patterns + +--- + +## Task 7.3: Update CollectorManager + +**Objective**: Add `server=otel` config option to route metric creation to the new OTel backend. + +**What to do**: + +- Edit `src/xrpld/app/main/CollectorManager.cpp`: + - In the constructor, add a third branch after `server == "statsd"`: + ```cpp + else if (server == "otel") + { + // Read endpoint from [telemetry] section + auto const endpoint = get(telemetryParams, "endpoint", + "http://localhost:4318/v1/metrics"); + std::string const& prefix(get(params, "prefix")); + m_collector = beast::insight::OTelCollector::New( + endpoint, prefix, journal); + } + ``` + - This requires access to the `[telemetry]` config section — may need to pass it as a parameter or read from Application config. + +- Edit `src/xrpld/app/main/CollectorManager.h`: + - Add `#include ` + +**Key modified files**: + +- `src/xrpld/app/main/CollectorManager.cpp` +- `src/xrpld/app/main/CollectorManager.h` + +--- + +## Task 7.4: Update OTel Collector Configuration + +**Objective**: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency. + +**What to do**: + +- Edit `docker/telemetry/otel-collector-config.yaml`: + - Remove `statsd` receiver (no longer needed when `server=otel`) + - Add metrics pipeline under `service.pipelines`: + ```yaml + metrics: + receivers: [otlp, spanmetrics] + processors: [batch] + exporters: [prometheus] + ``` + - The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers. + - Keep `spanmetrics` connector in the metrics pipeline so span-derived RED metrics continue working. + +- Edit `docker/telemetry/docker-compose.yml`: + - Remove UDP :8125 port mapping from otel-collector service + - Update rippled service config: change `[insight] server=statsd` to `server=otel` + +**Key modified files**: + +- `docker/telemetry/otel-collector-config.yaml` +- `docker/telemetry/docker-compose.yml` + +**Note**: Keep a commented-out `statsd` receiver block for operators who need backward compatibility. + +--- + +## Task 7.5: Preserve Metric Names in Prometheus + +**Objective**: Ensure existing Grafana dashboards continue working with identical metric names. + +**What to do**: + +- In `OTelCollector.cpp`, construct OTel instrument names to match existing Prometheus metric names: + - beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `rippled_LedgerMaster_Validated_Ledger_Age` + - The prefix + group + name concatenation must produce the same string as `StatsDCollector`'s format + - Use underscores as separators (matching StatsD convention) + +- Verify in integration test that key Prometheus queries still return data: + - `rippled_LedgerMaster_Validated_Ledger_Age` + - `rippled_Peer_Finder_Active_Inbound_Peers` + - `rippled_rpc_requests` + +**Key consideration**: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds `_total` suffix to counters and converts dots to underscores — match existing conventions. + +--- + +## Task 7.6: Update Grafana Dashboards + +**Objective**: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences. + +**What to do**: + +- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed. +- If OTLP export produces different names (e.g., `_total` suffix on counters), update: + - `docker/telemetry/grafana/dashboards/statsd-node-health.json` + - `docker/telemetry/grafana/dashboards/statsd-network-traffic.json` + - `docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json` +- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced). + +**Key modified files**: + +- `docker/telemetry/grafana/dashboards/statsd-*.json` (3 files, conditionally) + +--- + +## Task 7.7: Update Integration Tests + +**Objective**: Verify the full OTLP metrics pipeline end-to-end. + +**What to do**: + +- Edit `docker/telemetry/integration-test.sh`: + - Update test config to use `[insight] server=otel` + - Verify metrics arrive in Prometheus via OTLP (not StatsD) + - Add check that StatsD receiver is no longer required + - Preserve all existing metric presence checks + +**Key modified files**: + +- `docker/telemetry/integration-test.sh` + +--- + +## Task 7.8: Update Documentation + +**Objective**: Update all plan docs, runbook, and reference docs to reflect the migration. + +**What to do**: + +- Edit `docs/telemetry-runbook.md`: + - Update `[insight]` config examples to show `server=otel` + - Update troubleshooting section (no more StatsD UDP debugging) + +- Edit `OpenTelemetryPlan/09-data-collection-reference.md`: + - Update Data Flow Overview diagram (remove StatsD receiver) + - Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)" + - Update config examples + +- Edit `OpenTelemetryPlan/05-configuration-reference.md`: + - Add `server=otel` option to `[insight]` section docs + +- Edit `docker/telemetry/TESTING.md`: + - Update setup instructions to use `server=otel` + +**Key modified files**: + +- `docs/telemetry-runbook.md` +- `OpenTelemetryPlan/09-data-collection-reference.md` +- `OpenTelemetryPlan/05-configuration-reference.md` +- `docker/telemetry/TESTING.md` + +--- + +## Summary Table + +| Task | Description | New Files | Modified Files | Depends On | +| ---- | -------------------------------------- | --------- | -------------- | ---------- | +| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | — | +| 7.2 | Implement OTelCollector class | 2 | 0 | 7.1 | +| 7.3 | Update CollectorManager config routing | 0 | 2 | 7.2 | +| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 7.3 | +| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 7.2 | +| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 7.5 | +| 7.7 | Update integration tests | 0 | 1 | 7.4 | +| 7.8 | Update documentation | 0 | 4 | 7.6 | + +**Parallel work**: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6. + +**Exit Criteria** (from [06-implementation-phases.md §6.8](./06-implementation-phases.md)): + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant — Meter mapped to OTel Counter diff --git a/cmake/XrplCore.cmake b/cmake/XrplCore.cmake index 724a51622b..1cfcb082b9 100644 --- a/cmake/XrplCore.cmake +++ b/cmake/XrplCore.cmake @@ -78,6 +78,13 @@ include(target_link_modules) # Level 01 add_module(xrpl beast) target_link_libraries(xrpl.libxrpl.beast PUBLIC xrpl.imports.main) +# OTelCollector in beast/insight uses OTel Metrics SDK when telemetry is enabled. +if(telemetry) + target_link_libraries( + xrpl.libxrpl.beast + PUBLIC opentelemetry-cpp::opentelemetry-cpp + ) +endif() include(GitInfo) add_module(xrpl git) diff --git a/docker/telemetry/TESTING.md b/docker/telemetry/TESTING.md index 9b88429f68..72d5e4ee81 100644 --- a/docker/telemetry/TESTING.md +++ b/docker/telemetry/TESTING.md @@ -444,21 +444,21 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r | jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}' ``` -### StatsD Metrics (beast::insight) +### System Metrics (beast::insight via OTel native) -rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector -on port 8125. These appear in Prometheus alongside spanmetrics. +rippled's built-in `beast::insight` framework exports metrics natively via OTLP/HTTP to the OTel Collector +on port 4318 (same endpoint as traces). These appear in Prometheus alongside spanmetrics. Requires `[insight]` config in `xrpld.cfg`: ```ini [insight] -server=statsd -address=127.0.0.1:8125 +server=otel +endpoint=http://localhost:4318/v1/metrics prefix=rippled ``` -Verify StatsD metrics in Prometheus: +Verify system metrics in Prometheus: ```bash # Ledger age gauge @@ -477,7 +477,7 @@ curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq ' curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result' ``` -Key StatsD metrics (prefix `rippled_`): +Key system metrics (prefix `rippled_`): | Metric | Type | Source | | ------------------------------------- | --------- | ----------------------------------------- | @@ -514,11 +514,11 @@ Pre-configured dashboards (span-derived): - **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics - **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`) -Pre-configured dashboards (StatsD): +Pre-configured dashboards (system metrics): -- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue -- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category -- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings +- **Node Health (System Metrics)**: Validated/published ledger age, operating mode, I/O latency, job queue +- **Network Traffic (System Metrics)**: Peer counts, disconnects, overlay traffic by category +- **RPC & Pathfinding (System Metrics)**: RPC request rate/time/size, pathfinding duration, resource warnings Pre-configured datasources: @@ -575,7 +575,7 @@ Pre-configured datasources: service: pipelines: metrics: - receivers: [spanmetrics] + receivers: [otlp, spanmetrics] exporters: [prometheus] ``` 3. Verify Prometheus can reach collector: diff --git a/docker/telemetry/docker-compose.yml b/docker/telemetry/docker-compose.yml index 2dc6b8f9a3..be40c11773 100644 --- a/docker/telemetry/docker-compose.yml +++ b/docker/telemetry/docker-compose.yml @@ -26,10 +26,12 @@ services: command: ["--config=/etc/otel-collector-config.yaml"] ports: - "4317:4317" # OTLP gRPC - - "4318:4318" # OTLP HTTP - - "8125:8125/udp" # StatsD UDP (beast::insight metrics) - - "8889:8889" # Prometheus metrics (spanmetrics + statsd) + - "4318:4318" # OTLP HTTP (traces + native OTel metrics) + - "8889:8889" # Prometheus metrics (spanmetrics + OTLP) - "13133:13133" # Health check + # StatsD UDP port removed — beast::insight now uses native OTLP. + # Uncomment if using server=statsd fallback: + # - "8125:8125/udp" volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro depends_on: diff --git a/docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json b/docker/telemetry/grafana/dashboards/system-ledger-data-sync.json similarity index 65% rename from docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json rename to docker/telemetry/grafana/dashboards/system-ledger-data-sync.json index 502d78e7aa..67148abb63 100644 --- a/docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json +++ b/docker/telemetry/grafana/dashboards/system-ledger-data-sync.json @@ -2,7 +2,7 @@ "annotations": { "list": [] }, - "description": "Ledger data exchange and object fetch traffic from beast::insight StatsD. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=statsd in rippled config.", + "description": "Ledger data exchange and object fetch traffic from beast::insight System Metrics. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=otel in rippled config.", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 1, @@ -30,57 +30,57 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_get_Bytes_In", - "legendFormat": "Ledger Data Get" + "expr": "rippled_ledger_data_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Data Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_share_Bytes_In", - "legendFormat": "Ledger Data Share" + "expr": "rippled_ledger_data_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Data Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In", - "legendFormat": "TX Set Candidate Get" + "expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Set Candidate Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In", - "legendFormat": "TX Set Candidate Share" + "expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Set Candidate Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In", - "legendFormat": "TX Node Get" + "expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In", - "legendFormat": "TX Node Share" + "expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In", - "legendFormat": "Account State Node Get" + "expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Node Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In", - "legendFormat": "Account State Node Share" + "expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Node Share [{{exported_instance}}]" } ], "fieldConfig": { @@ -118,57 +118,57 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_share_Bytes_In", - "legendFormat": "Ledger Share In" + "expr": "rippled_ledger_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Share In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_get_Bytes_In", - "legendFormat": "Ledger Get In" + "expr": "rippled_ledger_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Get In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In", - "legendFormat": "TX Set Candidate Share" + "expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Set Candidate Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In", - "legendFormat": "TX Set Candidate Get" + "expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Set Candidate Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Transaction_node_share_Bytes_In", - "legendFormat": "TX Node Share" + "expr": "rippled_ledger_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Transaction_node_get_Bytes_In", - "legendFormat": "TX Node Get" + "expr": "rippled_ledger_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Account_State_node_share_Bytes_In", - "legendFormat": "Account State Share" + "expr": "rippled_ledger_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ledger_Account_State_node_get_Bytes_In", - "legendFormat": "Account State Get" + "expr": "rippled_ledger_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Get [{{exported_instance}}]" } ], "fieldConfig": { @@ -206,57 +206,57 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Ledger_get_Bytes_In", - "legendFormat": "Ledger Get" + "expr": "rippled_getobject_Ledger_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Ledger_share_Bytes_In", - "legendFormat": "Ledger Share" + "expr": "rippled_getobject_Ledger_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_get_Bytes_In", - "legendFormat": "Transaction Get" + "expr": "rippled_getobject_Transaction_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Transaction Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_share_Bytes_In", - "legendFormat": "Transaction Share" + "expr": "rippled_getobject_Transaction_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Transaction Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_node_get_Bytes_In", - "legendFormat": "TX Node Get" + "expr": "rippled_getobject_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_node_share_Bytes_In", - "legendFormat": "TX Node Share" + "expr": "rippled_getobject_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Account_State_node_get_Bytes_In", - "legendFormat": "Account State Get" + "expr": "rippled_getobject_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Account_State_node_share_Bytes_In", - "legendFormat": "Account State Share" + "expr": "rippled_getobject_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Share [{{exported_instance}}]" } ], "fieldConfig": { @@ -294,50 +294,50 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_CAS_get_Bytes_In", - "legendFormat": "CAS Get" + "expr": "rippled_getobject_CAS_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "CAS Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_CAS_share_Bytes_In", - "legendFormat": "CAS Share" + "expr": "rippled_getobject_CAS_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "CAS Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Fetch_Pack_share_Bytes_In", - "legendFormat": "Fetch Pack Share" + "expr": "rippled_getobject_Fetch_Pack_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Fetch Pack Share [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Fetch_Pack_get_Bytes_In", - "legendFormat": "Fetch Pack Get" + "expr": "rippled_getobject_Fetch_Pack_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Fetch Pack Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transactions_get_Bytes_In", - "legendFormat": "Transactions Get" + "expr": "rippled_getobject_Transactions_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Transactions Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_get_Bytes_In", - "legendFormat": "Aggregate Get" + "expr": "rippled_getobject_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Aggregate Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_share_Bytes_In", - "legendFormat": "Aggregate Share" + "expr": "rippled_getobject_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Aggregate Share [{{exported_instance}}]" } ], "fieldConfig": { @@ -375,55 +375,55 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Ledger_get_Messages_In", - "legendFormat": "Ledger Get" + "expr": "rippled_getobject_Ledger_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Ledger Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_get_Messages_In", - "legendFormat": "Transaction Get" + "expr": "rippled_getobject_Transaction_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Transaction Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transaction_node_get_Messages_In", - "legendFormat": "TX Node Get" + "expr": "rippled_getobject_Transaction_node_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Node Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Account_State_node_get_Messages_In", - "legendFormat": "Account State Get" + "expr": "rippled_getobject_Account_State_node_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Account State Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_CAS_get_Messages_In", - "legendFormat": "CAS Get" + "expr": "rippled_getobject_CAS_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "CAS Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Fetch_Pack_get_Messages_In", - "legendFormat": "Fetch Pack Get" + "expr": "rippled_getobject_Fetch_Pack_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Fetch Pack Get [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_getobject_Transactions_get_Messages_In", - "legendFormat": "Transactions Get" + "expr": "rippled_getobject_Transactions_get_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Transactions Get [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages In", "spanNulls": true, @@ -463,8 +463,8 @@ "datasource": { "type": "prometheus" }, - "expr": "topk(20, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", - "legendFormat": "{{__name__}}" + "expr": "topk(20, {exported_instance=~\"$node\", __name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", + "legendFormat": "{{__name__}} [{{exported_instance}}]" } ], "fieldConfig": { @@ -495,12 +495,33 @@ "schemaVersion": 39, "tags": ["rippled", "statsd", "ledger", "sync", "telemetry"], "templating": { - "list": [] + "list": [ + { + "name": "node", + "label": "Node", + "description": "Filter by rippled node (service.instance.id)", + "type": "query", + "query": "label_values(rippled_ledger_data_get_Bytes_In, exported_instance)", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "includeAll": true, + "allValue": ".*", + "current": { + "text": "All", + "value": "$__all" + }, + "multi": true, + "refresh": 2, + "sort": 1 + } + ] }, "time": { "from": "now-1h", "to": "now" }, - "title": "Ledger Data & Sync (StatsD)", - "uid": "rippled-statsd-ledger-sync" + "title": "Ledger Data & Sync (System Metrics)", + "uid": "rippled-system-ledger-sync" } diff --git a/docker/telemetry/grafana/dashboards/statsd-network-traffic.json b/docker/telemetry/grafana/dashboards/system-network-traffic.json similarity index 80% rename from docker/telemetry/grafana/dashboards/statsd-network-traffic.json rename to docker/telemetry/grafana/dashboards/system-network-traffic.json index 8dc072ba23..82faa28476 100644 --- a/docker/telemetry/grafana/dashboards/statsd-network-traffic.json +++ b/docker/telemetry/grafana/dashboards/system-network-traffic.json @@ -2,7 +2,7 @@ "annotations": { "list": [] }, - "description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "description": "Network traffic and peer metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 1, @@ -30,20 +30,20 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_Peer_Finder_Active_Inbound_Peers", - "legendFormat": "Inbound Peers" + "expr": "rippled_Peer_Finder_Active_Inbound_Peers{exported_instance=~\"$node\"}", + "legendFormat": "Inbound Peers [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_Peer_Finder_Active_Outbound_Peers", - "legendFormat": "Outbound Peers" + "expr": "rippled_Peer_Finder_Active_Outbound_Peers{exported_instance=~\"$node\"}", + "legendFormat": "Outbound Peers [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Peers", "spanNulls": true, @@ -76,13 +76,13 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_Overlay_Peer_Disconnects", - "legendFormat": "Disconnects" + "expr": "rippled_Overlay_Peer_Disconnects{exported_instance=~\"$node\"}", + "legendFormat": "Disconnects [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Disconnects", "spanNulls": true, @@ -115,15 +115,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_total_Bytes_In", - "legendFormat": "Bytes In" + "expr": "rippled_total_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_total_Bytes_Out", - "legendFormat": "Bytes Out" + "expr": "rippled_total_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Bytes Out [{{exported_instance}}]" } ], "fieldConfig": { @@ -161,20 +161,20 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_total_Messages_In", - "legendFormat": "Messages In" + "expr": "rippled_total_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Messages In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_total_Messages_Out", - "legendFormat": "Messages Out" + "expr": "rippled_total_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Messages Out [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -207,27 +207,27 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_transactions_Messages_In", - "legendFormat": "TX Messages In" + "expr": "rippled_transactions_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Messages In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_transactions_Messages_Out", - "legendFormat": "TX Messages Out" + "expr": "rippled_transactions_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "TX Messages Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_transactions_duplicate_Messages_In", - "legendFormat": "TX Duplicate In" + "expr": "rippled_transactions_duplicate_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "TX Duplicate In [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -260,34 +260,34 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_proposals_Messages_In", - "legendFormat": "Proposals In" + "expr": "rippled_proposals_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Proposals In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proposals_Messages_Out", - "legendFormat": "Proposals Out" + "expr": "rippled_proposals_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Proposals Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proposals_untrusted_Messages_In", - "legendFormat": "Untrusted In" + "expr": "rippled_proposals_untrusted_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Untrusted In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proposals_duplicate_Messages_In", - "legendFormat": "Duplicate In" + "expr": "rippled_proposals_duplicate_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Duplicate In [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -320,34 +320,34 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_validations_Messages_In", - "legendFormat": "Validations In" + "expr": "rippled_validations_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Validations In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validations_Messages_Out", - "legendFormat": "Validations Out" + "expr": "rippled_validations_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Validations Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validations_untrusted_Messages_In", - "legendFormat": "Untrusted In" + "expr": "rippled_validations_untrusted_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Untrusted In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validations_duplicate_Messages_In", - "legendFormat": "Duplicate In" + "expr": "rippled_validations_duplicate_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Duplicate In [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -380,8 +380,8 @@ "datasource": { "type": "prometheus" }, - "expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", - "legendFormat": "{{__name__}}" + "expr": "topk(10, {exported_instance=~\"$node\", __name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", + "legendFormat": "{{__name__}} [{{exported_instance}}]" } ], "fieldConfig": { @@ -660,12 +660,33 @@ "schemaVersion": 39, "tags": ["rippled", "statsd", "network", "telemetry"], "templating": { - "list": [] + "list": [ + { + "name": "node", + "label": "Node", + "description": "Filter by rippled node (service.instance.id)", + "type": "query", + "query": "label_values(rippled_Peer_Finder_Active_Inbound_Peers, exported_instance)", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "includeAll": true, + "allValue": ".*", + "current": { + "text": "All", + "value": "$__all" + }, + "multi": true, + "refresh": 2, + "sort": 1 + } + ] }, "time": { "from": "now-1h", "to": "now" }, - "title": "Network Traffic (StatsD)", - "uid": "rippled-statsd-network" + "title": "Network Traffic (System Metrics)", + "uid": "rippled-system-network" } diff --git a/docker/telemetry/grafana/dashboards/statsd-node-health.json b/docker/telemetry/grafana/dashboards/system-node-health.json similarity index 71% rename from docker/telemetry/grafana/dashboards/statsd-node-health.json rename to docker/telemetry/grafana/dashboards/system-node-health.json index 215187f382..456c62b2e1 100644 --- a/docker/telemetry/grafana/dashboards/statsd-node-health.json +++ b/docker/telemetry/grafana/dashboards/system-node-health.json @@ -2,7 +2,7 @@ "annotations": { "list": [] }, - "description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "description": "Node health metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 1, @@ -30,8 +30,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_LedgerMaster_Validated_Ledger_Age", - "legendFormat": "Validated Age" + "expr": "rippled_LedgerMaster_Validated_Ledger_Age{exported_instance=~\"$node\"}", + "legendFormat": "Validated Age [{{exported_instance}}]" } ], "fieldConfig": { @@ -78,8 +78,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_LedgerMaster_Published_Ledger_Age", - "legendFormat": "Published Age" + "expr": "rippled_LedgerMaster_Published_Ledger_Age{exported_instance=~\"$node\"}", + "legendFormat": "Published Age [{{exported_instance}}]" } ], "fieldConfig": { @@ -107,7 +107,7 @@ }, { "title": "Operating Mode Duration", - "description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.", + "description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778) which report microseconds. A healthy node should spend the vast majority of time in Full mode.", "type": "timeseries", "gridPos": { "h": 8, @@ -126,43 +126,43 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Full_duration", - "legendFormat": "Full" + "expr": "rippled_State_Accounting_Full_duration{exported_instance=~\"$node\"}", + "legendFormat": "Full [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Tracking_duration", - "legendFormat": "Tracking" + "expr": "rippled_State_Accounting_Tracking_duration{exported_instance=~\"$node\"}", + "legendFormat": "Tracking [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Syncing_duration", - "legendFormat": "Syncing" + "expr": "rippled_State_Accounting_Syncing_duration{exported_instance=~\"$node\"}", + "legendFormat": "Syncing [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Connected_duration", - "legendFormat": "Connected" + "expr": "rippled_State_Accounting_Connected_duration{exported_instance=~\"$node\"}", + "legendFormat": "Connected [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Disconnected_duration", - "legendFormat": "Disconnected" + "expr": "rippled_State_Accounting_Disconnected_duration{exported_instance=~\"$node\"}", + "legendFormat": "Disconnected [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "s", + "unit": "µs", "custom": { - "axisLabel": "Duration (Sec)", + "axisLabel": "Duration", "spanNulls": true, "insertNulls": false, "showPoints": "auto", @@ -193,41 +193,41 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Full_transitions", - "legendFormat": "Full" + "expr": "rippled_State_Accounting_Full_transitions{exported_instance=~\"$node\"}", + "legendFormat": "Full [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Tracking_transitions", - "legendFormat": "Tracking" + "expr": "rippled_State_Accounting_Tracking_transitions{exported_instance=~\"$node\"}", + "legendFormat": "Tracking [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Syncing_transitions", - "legendFormat": "Syncing" + "expr": "rippled_State_Accounting_Syncing_transitions{exported_instance=~\"$node\"}", + "legendFormat": "Syncing [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Connected_transitions", - "legendFormat": "Connected" + "expr": "rippled_State_Accounting_Connected_transitions{exported_instance=~\"$node\"}", + "legendFormat": "Connected [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_State_Accounting_Disconnected_transitions", - "legendFormat": "Disconnected" + "expr": "rippled_State_Accounting_Disconnected_transitions{exported_instance=~\"$node\"}", + "legendFormat": "Disconnected [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Transitions", "spanNulls": true, @@ -260,15 +260,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_ios_latency{quantile=\"0.95\"}", - "legendFormat": "P95 I/O Latency" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_ios_latency_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 I/O Latency [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_ios_latency{quantile=\"0.5\"}", - "legendFormat": "P50 I/O Latency" + "expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(rippled_ios_latency_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 I/O Latency [{{exported_instance}}]" } ], "fieldConfig": { @@ -287,7 +287,7 @@ }, { "title": "Job Queue Depth", - "description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.", + "description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough — common during ledger replay or heavy RPC load.", "type": "timeseries", "gridPos": { "h": 8, @@ -306,13 +306,13 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_job_count", - "legendFormat": "Job Queue Depth" + "expr": "rippled_job_count{exported_instance=~\"$node\"}", + "legendFormat": "Job Queue Depth [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Jobs", "spanNulls": true, @@ -345,8 +345,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rate(rippled_ledger_fetches_total[5m])", - "legendFormat": "Fetches / Sec" + "expr": "rate(rippled_ledger_fetches_total{exported_instance=~\"$node\"}[5m])", + "legendFormat": "Fetches / Sec [{{exported_instance}}]" } ], "fieldConfig": { @@ -377,8 +377,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rate(rippled_ledger_history_mismatch_total[5m])", - "legendFormat": "Mismatches / Sec" + "expr": "rate(rippled_ledger_history_mismatch_total{exported_instance=~\"$node\"}[5m])", + "legendFormat": "Mismatches / Sec [{{exported_instance}}]" } ], "fieldConfig": { @@ -404,12 +404,33 @@ "schemaVersion": 39, "tags": ["rippled", "statsd", "node-health", "telemetry"], "templating": { - "list": [] + "list": [ + { + "name": "node", + "label": "Node", + "description": "Filter by rippled node (service.instance.id)", + "type": "query", + "query": "label_values(rippled_LedgerMaster_Validated_Ledger_Age, exported_instance)", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "includeAll": true, + "allValue": ".*", + "current": { + "text": "All", + "value": "$__all" + }, + "multi": true, + "refresh": 2, + "sort": 1 + } + ] }, "time": { "from": "now-1h", "to": "now" }, - "title": "Node Health (StatsD)", - "uid": "rippled-statsd-node-health" + "title": "Node Health (System Metrics)", + "uid": "rippled-system-node-health" } diff --git a/docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json b/docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json similarity index 66% rename from docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json rename to docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json index a09a2b5d17..5ff2fbf4af 100644 --- a/docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json +++ b/docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json @@ -2,7 +2,7 @@ "annotations": { "list": [] }, - "description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=statsd in rippled config.", + "description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=otel in rippled config.", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 1, @@ -30,48 +30,48 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_Messages_In", - "legendFormat": "Squelch In" + "expr": "rippled_squelch_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Squelch In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_Messages_Out", - "legendFormat": "Squelch Out" + "expr": "rippled_squelch_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Squelch Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_suppressed_Messages_In", - "legendFormat": "Suppressed In" + "expr": "rippled_squelch_suppressed_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Suppressed In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_suppressed_Messages_Out", - "legendFormat": "Suppressed Out" + "expr": "rippled_squelch_suppressed_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Suppressed Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_ignored_Messages_In", - "legendFormat": "Ignored In" + "expr": "rippled_squelch_ignored_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Ignored In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_squelch_ignored_Messages_Out", - "legendFormat": "Ignored Out" + "expr": "rippled_squelch_ignored_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Ignored Out [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -104,43 +104,43 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_Bytes_In", - "legendFormat": "Base Overhead In" + "expr": "rippled_overhead_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Base Overhead In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_Bytes_Out", - "legendFormat": "Base Overhead Out" + "expr": "rippled_overhead_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Base Overhead Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_cluster_Bytes_In", - "legendFormat": "Cluster In" + "expr": "rippled_overhead_cluster_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Cluster In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_cluster_Bytes_Out", - "legendFormat": "Cluster Out" + "expr": "rippled_overhead_cluster_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Cluster Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_manifest_Bytes_In", - "legendFormat": "Manifest In" + "expr": "rippled_overhead_manifest_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Manifest In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_overhead_manifest_Bytes_Out", - "legendFormat": "Manifest Out" + "expr": "rippled_overhead_manifest_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Manifest Out [{{exported_instance}}]" } ], "fieldConfig": { @@ -178,34 +178,34 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_validator_lists_Bytes_In", - "legendFormat": "Bytes In" + "expr": "rippled_validator_lists_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validator_lists_Bytes_Out", - "legendFormat": "Bytes Out" + "expr": "rippled_validator_lists_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Bytes Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validator_lists_Messages_In", - "legendFormat": "Messages In" + "expr": "rippled_validator_lists_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Messages In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_validator_lists_Messages_Out", - "legendFormat": "Messages Out" + "expr": "rippled_validator_lists_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Messages Out [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Count", "spanNulls": true, @@ -255,29 +255,29 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_set_get_Bytes_In", - "legendFormat": "Set Get In" + "expr": "rippled_set_get_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Set Get In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_set_get_Bytes_Out", - "legendFormat": "Set Get Out" + "expr": "rippled_set_get_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Set Get Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_set_share_Bytes_In", - "legendFormat": "Set Share In" + "expr": "rippled_set_share_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Set Share In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_set_share_Bytes_Out", - "legendFormat": "Set Share Out" + "expr": "rippled_set_share_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Set Share Out [{{exported_instance}}]" } ], "fieldConfig": { @@ -315,34 +315,34 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_have_transactions_Messages_In", - "legendFormat": "Have TX In" + "expr": "rippled_have_transactions_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Have TX In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_have_transactions_Messages_Out", - "legendFormat": "Have TX Out" + "expr": "rippled_have_transactions_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Have TX Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_requested_transactions_Messages_In", - "legendFormat": "Requested TX In" + "expr": "rippled_requested_transactions_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Requested TX In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_requested_transactions_Messages_Out", - "legendFormat": "Requested TX Out" + "expr": "rippled_requested_transactions_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Requested TX Out [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Messages", "spanNulls": true, @@ -375,34 +375,34 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_unknown_Bytes_In", - "legendFormat": "Unknown Bytes In" + "expr": "rippled_unknown_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Unknown Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_unknown_Bytes_Out", - "legendFormat": "Unknown Bytes Out" + "expr": "rippled_unknown_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Unknown Bytes Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_unknown_Messages_In", - "legendFormat": "Unknown Messages In" + "expr": "rippled_unknown_Messages_In{exported_instance=~\"$node\"}", + "legendFormat": "Unknown Messages In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_unknown_Messages_Out", - "legendFormat": "Unknown Messages Out" + "expr": "rippled_unknown_Messages_Out{exported_instance=~\"$node\"}", + "legendFormat": "Unknown Messages Out [{{exported_instance}}]" } ], "fieldConfig": { "defaults": { - "unit": "short", + "unit": "none", "custom": { "axisLabel": "Count", "spanNulls": true, @@ -452,29 +452,29 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_proof_path_request_Bytes_In", - "legendFormat": "Request Bytes In" + "expr": "rippled_proof_path_request_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Request Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proof_path_request_Bytes_Out", - "legendFormat": "Request Bytes Out" + "expr": "rippled_proof_path_request_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Request Bytes Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proof_path_response_Bytes_In", - "legendFormat": "Response Bytes In" + "expr": "rippled_proof_path_response_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Response Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_proof_path_response_Bytes_Out", - "legendFormat": "Response Bytes Out" + "expr": "rippled_proof_path_response_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Response Bytes Out [{{exported_instance}}]" } ], "fieldConfig": { @@ -512,29 +512,29 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_replay_delta_request_Bytes_In", - "legendFormat": "Request Bytes In" + "expr": "rippled_replay_delta_request_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Request Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_replay_delta_request_Bytes_Out", - "legendFormat": "Request Bytes Out" + "expr": "rippled_replay_delta_request_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Request Bytes Out [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_replay_delta_response_Bytes_In", - "legendFormat": "Response Bytes In" + "expr": "rippled_replay_delta_response_Bytes_In{exported_instance=~\"$node\"}", + "legendFormat": "Response Bytes In [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_replay_delta_response_Bytes_Out", - "legendFormat": "Response Bytes Out" + "expr": "rippled_replay_delta_response_Bytes_Out{exported_instance=~\"$node\"}", + "legendFormat": "Response Bytes Out [{{exported_instance}}]" } ], "fieldConfig": { @@ -555,12 +555,33 @@ "schemaVersion": 39, "tags": ["rippled", "statsd", "overlay", "network", "telemetry"], "templating": { - "list": [] + "list": [ + { + "name": "node", + "label": "Node", + "description": "Filter by rippled node (service.instance.id)", + "type": "query", + "query": "label_values(rippled_squelch_Messages_In, exported_instance)", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "includeAll": true, + "allValue": ".*", + "current": { + "text": "All", + "value": "$__all" + }, + "multi": true, + "refresh": 2, + "sort": 1 + } + ] }, "time": { "from": "now-1h", "to": "now" }, - "title": "Overlay Traffic Detail (StatsD)", - "uid": "rippled-statsd-overlay-detail" + "title": "Overlay Traffic Detail (System Metrics)", + "uid": "rippled-system-overlay-detail" } diff --git a/docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json b/docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json similarity index 69% rename from docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json rename to docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json index 10bf1575e3..5e631747dc 100644 --- a/docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json +++ b/docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json @@ -2,7 +2,7 @@ "annotations": { "list": [] }, - "description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "description": "RPC and pathfinding metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 1, @@ -10,7 +10,7 @@ "links": [], "panels": [ { - "title": "RPC Request Rate (StatsD)", + "title": "RPC Request Rate (System Metrics)", "description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.", "type": "stat", "gridPos": { @@ -30,8 +30,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rate(rippled_rpc_requests_total[5m])", - "legendFormat": "Requests / Sec" + "expr": "rate(rippled_rpc_requests_total{exported_instance=~\"$node\"}[5m])", + "legendFormat": "Requests / Sec [{{exported_instance}}]" } ], "fieldConfig": { @@ -42,7 +42,7 @@ } }, { - "title": "RPC Response Time (StatsD)", + "title": "RPC Response Time (System Metrics)", "description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.", "type": "timeseries", "gridPos": { @@ -62,15 +62,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.95\"}", - "legendFormat": "P95 Response Time" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 Response Time [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.5\"}", - "legendFormat": "P50 Response Time" + "expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 Response Time [{{exported_instance}}]" } ], "fieldConfig": { @@ -108,15 +108,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_size{quantile=\"0.95\"}", - "legendFormat": "P95 Response Size" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_size_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 Response Size [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_size{quantile=\"0.5\"}", - "legendFormat": "P50 Response Size" + "expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_size_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 Response Size [{{exported_instance}}]" } ], "fieldConfig": { @@ -154,29 +154,29 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.5\"}", - "legendFormat": "P50" + "expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.9\"}", - "legendFormat": "P90" + "expr": "histogram_quantile(0.9, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P90 [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.95\"}", - "legendFormat": "P95" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_rpc_time{quantile=\"0.99\"}", - "legendFormat": "P99" + "expr": "histogram_quantile(0.99, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P99 [{{exported_instance}}]" } ], "fieldConfig": { @@ -214,15 +214,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_pathfind_fast{quantile=\"0.95\"}", - "legendFormat": "P95 Fast Pathfind" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_pathfind_fast_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 Fast Pathfind [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_pathfind_fast{quantile=\"0.5\"}", - "legendFormat": "P50 Fast Pathfind" + "expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_pathfind_fast_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 Fast Pathfind [{{exported_instance}}]" } ], "fieldConfig": { @@ -260,15 +260,15 @@ "datasource": { "type": "prometheus" }, - "expr": "rippled_pathfind_full{quantile=\"0.95\"}", - "legendFormat": "P95 Full Pathfind" + "expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_pathfind_full_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P95 Full Pathfind [{{exported_instance}}]" }, { "datasource": { "type": "prometheus" }, - "expr": "rippled_pathfind_full{quantile=\"0.5\"}", - "legendFormat": "P50 Full Pathfind" + "expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_pathfind_full_bucket{exported_instance=~\"$node\"}[5m])))", + "legendFormat": "P50 Full Pathfind [{{exported_instance}}]" } ], "fieldConfig": { @@ -287,7 +287,7 @@ }, { "title": "Resource Warnings Rate", - "description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).", + "description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in System MetricsCollector.cpp:706 (Phase 6 Task 6.1).", "type": "stat", "gridPos": { "h": 8, @@ -306,8 +306,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rate(rippled_warn_total[5m])", - "legendFormat": "Warnings / Sec" + "expr": "rate(rippled_warn_total{exported_instance=~\"$node\"}[5m])", + "legendFormat": "Warnings / Sec [{{exported_instance}}]" } ], "fieldConfig": { @@ -335,7 +335,7 @@ }, { "title": "Resource Drops Rate", - "description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).", + "description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in System MetricsCollector.cpp:706 (Phase 6 Task 6.1).", "type": "stat", "gridPos": { "h": 8, @@ -354,8 +354,8 @@ "datasource": { "type": "prometheus" }, - "expr": "rate(rippled_drop_total[5m])", - "legendFormat": "Drops / Sec" + "expr": "rate(rippled_drop_total{exported_instance=~\"$node\"}[5m])", + "legendFormat": "Drops / Sec [{{exported_instance}}]" } ], "fieldConfig": { @@ -385,12 +385,33 @@ "schemaVersion": 39, "tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"], "templating": { - "list": [] + "list": [ + { + "name": "node", + "label": "Node", + "description": "Filter by rippled node (service.instance.id)", + "type": "query", + "query": "label_values(rippled_rpc_requests_total, exported_instance)", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "includeAll": true, + "allValue": ".*", + "current": { + "text": "All", + "value": "$__all" + }, + "multi": true, + "refresh": 2, + "sort": 1 + } + ] }, "time": { "from": "now-1h", "to": "now" }, - "title": "RPC & Pathfinding (StatsD)", - "uid": "rippled-statsd-rpc" + "title": "RPC & Pathfinding (System Metrics)", + "uid": "rippled-system-rpc" } diff --git a/docker/telemetry/integration-test.sh b/docker/telemetry/integration-test.sh index 52d7706e40..07b30b9a73 100755 --- a/docker/telemetry/integration-test.sh +++ b/docker/telemetry/integration-test.sh @@ -312,8 +312,8 @@ trace_peer=1 trace_ledger=1 [insight] -server=statsd -address=127.0.0.1:8125 +server=otel +endpoint=http://localhost:4318/v1/metrics prefix=rippled [rpc_startup] @@ -539,42 +539,52 @@ else fi # --------------------------------------------------------------------------- -# Step 10b: Verify StatsD metrics in Prometheus +# Step 10b: Verify native OTel metrics in Prometheus (beast::insight) # --------------------------------------------------------------------------- log "" -log "--- Phase 6: StatsD Metrics (beast::insight) ---" -log "Waiting 20s for StatsD aggregation + Prometheus scrape..." +log "--- Phase 7: Native OTel Metrics (beast::insight via OTLP) ---" +log "Waiting 20s for OTLP metric export + Prometheus scrape..." sleep 20 -check_statsd_metric() { +check_otel_metric() { local metric_name="$1" local result result=$(curl -sf "$PROM/api/v1/query?query=$metric_name" \ | jq '.data.result | length' 2>/dev/null || echo 0) if [ "$result" -gt 0 ]; then - ok "StatsD: $metric_name ($result series)" + ok "OTel: $metric_name ($result series)" else - fail "StatsD: $metric_name (0 series)" + fail "OTel: $metric_name (0 series)" fi } -# Node health gauges -check_statsd_metric "rippled_LedgerMaster_Validated_Ledger_Age" -check_statsd_metric "rippled_LedgerMaster_Published_Ledger_Age" -check_statsd_metric "rippled_job_count" +# Node health gauges (ObservableGauge — no _total suffix) +check_otel_metric "rippled_LedgerMaster_Validated_Ledger_Age" +check_otel_metric "rippled_LedgerMaster_Published_Ledger_Age" +check_otel_metric "rippled_job_count" # State accounting -check_statsd_metric "rippled_State_Accounting_Full_duration" +check_otel_metric "rippled_State_Accounting_Full_duration" # Peer finder -check_statsd_metric "rippled_Peer_Finder_Active_Inbound_Peers" -check_statsd_metric "rippled_Peer_Finder_Active_Outbound_Peers" +check_otel_metric "rippled_Peer_Finder_Active_Inbound_Peers" +check_otel_metric "rippled_Peer_Finder_Active_Outbound_Peers" -# RPC counters (only if RPC was exercised — should be true from Steps 5-8) -check_statsd_metric "rippled_rpc_requests" +# RPC counters (Counter — Prometheus adds _total suffix automatically) +check_otel_metric "rippled_rpc_requests_total" # Overlay traffic -check_statsd_metric "rippled_total_Bytes_In" +check_otel_metric "rippled_total_Bytes_In" + +# Verify StatsD receiver is NOT required (no statsd receiver in pipeline) +log "" +log "--- Verify StatsD receiver is not required ---" +statsd_port_check=$(curl -sf "http://localhost:8125" 2>&1 || echo "refused") +if echo "$statsd_port_check" | grep -qi "refused\|error\|connection"; then + ok "StatsD port 8125 is not listening (not required)" +else + fail "StatsD port 8125 appears to be listening (should not be needed)" +fi # --------------------------------------------------------------------------- # Step 11: Summary diff --git a/docker/telemetry/otel-collector-config.yaml b/docker/telemetry/otel-collector-config.yaml index ff7734a234..3cd3e2b639 100644 --- a/docker/telemetry/otel-collector-config.yaml +++ b/docker/telemetry/otel-collector-config.yaml @@ -2,22 +2,21 @@ # # Pipelines: # traces: OTLP receiver -> batch processor -> debug + Jaeger + Tempo + spanmetrics -# metrics: spanmetrics connector + StatsD receiver -> Prometheus exporter +# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter # # rippled sends traces via OTLP/HTTP to port 4318. The collector batches # them, forwards to both Jaeger and Tempo, and derives RED metrics via the # spanmetrics connector, which Prometheus scrapes on port 8889. # -# rippled also sends beast::insight metrics via StatsD/UDP to port 8125. -# These are ingested by the statsd receiver and merged into the same -# Prometheus endpoint alongside span-derived metrics. +# rippled sends beast::insight metrics natively via OTLP/HTTP to port 4318 +# (same endpoint as traces). The OTLP receiver feeds both the traces and +# metrics pipelines. Metrics are exported to Prometheus alongside +# span-derived metrics. # -# TODO: The Resource Manager's "warn" and "drop" metrics use the non-standard -# "|m" (meter) StatsD type in StatsDCollector.cpp:706. The OTel StatsD -# receiver silently drops "|m" metrics since it only recognizes standard -# types (|c, |g, |ms, |h, |s). To capture these two metrics, change "|m" -# to "|c" in StatsDCollector.cpp — this is a breaking change for any -# backend that relied on the custom "|m" type. Tracked as Phase 6 Task 6.1. +# For backward compatibility, the StatsD receiver config is preserved below +# but commented out. If you need StatsD fallback (server=statsd in +# [insight]), uncomment the statsd receiver and add it to the metrics +# pipeline receivers list. receivers: otlp: @@ -26,20 +25,22 @@ receivers: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 - statsd: - endpoint: "0.0.0.0:8125" - aggregation_interval: 15s - enable_metric_type: true - is_monotonic_counter: true - timer_histogram_mapping: - - statsd_type: "timing" - observer_type: "summary" - summary: - percentiles: [0, 50, 90, 95, 99, 100] - - statsd_type: "histogram" - observer_type: "summary" - summary: - percentiles: [0, 50, 90, 95, 99, 100] + # StatsD receiver — kept for backward compatibility with server=statsd. + # Uncomment and add "statsd" to metrics pipeline receivers if needed. + # statsd: + # endpoint: "0.0.0.0:8125" + # aggregation_interval: 15s + # enable_metric_type: true + # is_monotonic_counter: true + # timer_histogram_mapping: + # - statsd_type: "timing" + # observer_type: "summary" + # summary: + # percentiles: [0, 50, 90, 95, 99, 100] + # - statsd_type: "histogram" + # observer_type: "summary" + # summary: + # percentiles: [0, 50, 90, 95, 99, 100] processors: batch: @@ -84,5 +85,6 @@ service: processors: [batch] exporters: [debug, otlp/jaeger, otlp/tempo, spanmetrics] metrics: - receivers: [spanmetrics, statsd] + receivers: [otlp, spanmetrics] + processors: [batch] exporters: [prometheus] diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index d1f3b892e9..31d2a717d3 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -161,14 +161,27 @@ Configured in `otel-collector-config.yaml`: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s ``` -## StatsD Metrics (beast::insight) +## System Metrics (beast::insight via OTel native) -rippled has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans. +rippled has a built-in metrics framework (`beast::insight`) that exports metrics natively via OTLP/HTTP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans. ### Configuration Add to `xrpld.cfg`: +```ini +[insight] +server=otel +endpoint=http://localhost:4318/v1/metrics +prefix=rippled +``` + +The OTel Collector receives these via the OTLP receiver (same endpoint as traces, port 4318) and exports them to Prometheus alongside spanmetrics. + +#### StatsD fallback (backward compatibility) + +The legacy StatsD backend is still available: + ```ini [insight] server=statsd @@ -176,7 +189,7 @@ address=127.0.0.1:8125 prefix=rippled ``` -The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics. +When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yaml` and add port `8125:8125/udp` to the docker-compose otel-collector service. ### Metric Reference @@ -284,7 +297,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section. | Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` | | Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` | -### Node Health — StatsD (`rippled-statsd-node-health`) +### Node Health — System Metrics (`rippled-system-node-health`) | Panel | Type | PromQL | Labels Used | | -------------------------- | ---------- | ------------------------------------------------------ | ----------- | @@ -297,7 +310,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section. | Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — | | Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — | -### Network Traffic — StatsD (`rippled-statsd-network`) +### Network Traffic — System Metrics (`rippled-system-network`) | Panel | Type | PromQL | Labels Used | | ---------------------- | ---------- | -------------------------------------- | ----------- | @@ -310,7 +323,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section. | Validation Traffic | timeseries | `rippled_validations_Messages_In/Out` | — | | Traffic by Category | bargauge | `topk(10, rippled_*_Bytes_In)` | — | -### RPC & Pathfinding — StatsD (`rippled-statsd-rpc`) +### RPC & Pathfinding — System Metrics (`rippled-system-rpc`) | Panel | Type | PromQL | Labels Used | | ------------------------- | ---------- | -------------------------------------------------------- | ----------- | @@ -354,6 +367,14 @@ Requires `trace_peer=1` in the `[telemetry]` config section. 3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces` 4. Check collector logs: `docker compose logs otel-collector` +### No system metrics in Prometheus + +1. Check rippled logs for `OTelCollector starting` message +2. Verify `server=otel` in the `[insight]` config section +3. Verify the endpoint in `[insight]` points to the OTLP/HTTP port (default: `http://localhost:4318/v1/metrics`) +4. Check that the `otlp` receiver is in the metrics pipeline receivers in `otel-collector-config.yaml` +5. Query Prometheus directly: `curl 'http://localhost:9090/api/v1/query?query=rippled_job_count'` + ### High memory usage - Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling) diff --git a/include/xrpl/beast/insight/Insight.h b/include/xrpl/beast/insight/Insight.h index bf3743cfd8..ee54111231 100644 --- a/include/xrpl/beast/insight/Insight.h +++ b/include/xrpl/beast/insight/Insight.h @@ -12,4 +12,5 @@ #include #include #include +#include #include diff --git a/include/xrpl/beast/insight/OTelCollector.h b/include/xrpl/beast/insight/OTelCollector.h new file mode 100644 index 0000000000..ee0dd2c1b0 --- /dev/null +++ b/include/xrpl/beast/insight/OTelCollector.h @@ -0,0 +1,92 @@ +#pragma once + +/** + * @file OTelCollector.h + * @brief OpenTelemetry-based implementation of the beast::insight::Collector + * interface for native OTLP metric export. + * + * When XRPL_ENABLE_TELEMETRY is defined, OTelCollector maps each + * beast::insight instrument type (Counter, Gauge, Event, Meter, Hook) to + * the corresponding OpenTelemetry Metrics SDK instrument and exports + * them via OTLP/HTTP to an OpenTelemetry Collector. + * + * When XRPL_ENABLE_TELEMETRY is NOT defined, OTelCollector::New() returns + * a NullCollector so the binary compiles without OTel dependencies. + * + * Dependency diagram: + * + * +-----------------+ +-------------------+ + * | Collector (ABC) |<----| OTelCollector | + * +-----------------+ | (public header) | + * ^ +-------------------+ + * | | + * +-----------------+ +-------------------+ + * | NullCollector | | OTelCollectorImp | + * | (fallback when | | (impl in .cpp, | + * | no telemetry) | | uses OTel SDK) | + * +-----------------+ +-------------------+ + * | + * +-------------------+ + * | OTel Metrics SDK | + * | MeterProvider | + * | OTLP HTTP Metric | + * | Exporter | + * +-------------------+ + */ + +#include +#include + +#include +#include + +namespace beast { +namespace insight { + +/** + * @brief A Collector that exports metrics via OpenTelemetry OTLP/HTTP. + * + * Replaces StatsD-based metric collection with native OTel Metrics SDK + * instruments. Each beast::insight instrument maps to an OTel equivalent: + * + * - Counter -> OTel Counter + * - Gauge -> OTel ObservableGauge (async callback) + * - Event -> OTel Histogram (duration in milliseconds) + * - Meter -> OTel Counter (monotonic, unsigned) + * - Hook -> Called by PeriodicMetricReader at collection time + * + * @see StatsDCollector for the StatsD-based alternative. + * @see NullCollector for the no-op fallback. + */ +class OTelCollector : public Collector +{ +public: + explicit OTelCollector() = default; + + /** + * @brief Factory method to create an OTelCollector instance. + * + * When XRPL_ENABLE_TELEMETRY is defined, creates a real OTel-backed + * collector that exports metrics via OTLP/HTTP. When telemetry is + * disabled at compile time, returns a NullCollector. + * + * @param endpoint OTLP/HTTP metrics endpoint URL + * (e.g. "http://localhost:4318/v1/metrics"). + * @param prefix Prefix prepended to all metric names + * (e.g. "rippled"). + * @param instanceId Unique identifier for this node instance, + * emitted as the `service.instance.id` OTel + * resource attribute. Defaults to empty string + * (attribute omitted when empty). + * @param journal Journal for logging. + * @return Shared pointer to the created Collector. + */ + static std::shared_ptr + New(std::string const& endpoint, + std::string const& prefix, + std::string const& instanceId, + Journal journal); +}; + +} // namespace insight +} // namespace beast diff --git a/src/libxrpl/beast/insight/OTelCollector.cpp b/src/libxrpl/beast/insight/OTelCollector.cpp new file mode 100644 index 0000000000..b4c684510b --- /dev/null +++ b/src/libxrpl/beast/insight/OTelCollector.cpp @@ -0,0 +1,879 @@ +/** + * @file OTelCollector.cpp + * @brief OpenTelemetry Metrics SDK implementation of beast::insight::Collector. + * + * Compiled only when XRPL_ENABLE_TELEMETRY is defined (via CMake + * telemetry=ON). Maps beast::insight instruments to OTel SDK instruments + * and exports them via OTLP/HTTP using a PeriodicMetricReader. + * + * When XRPL_ENABLE_TELEMETRY is not defined, OTelCollector::New() returns + * a NullCollector so the build succeeds without OTel dependencies. + * + * Data flow: + * + * beast::insight callers + * | + * v + * OTelCounterImpl / OTelGaugeImpl / OTelEventImpl / OTelMeterImpl + * | | | | + * v v v v + * OTel Counter ObservableGauge Histogram Counter + * | | | | + * +--------------------+----------------+--------------+ + * | + * v + * PeriodicMetricReader (1s interval) + * | + * v + * OtlpHttpMetricExporter -> OTel Collector -> Prometheus + */ + +#ifdef XRPL_ENABLE_TELEMETRY + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +namespace beast { +namespace insight { + +namespace detail { + +namespace metrics_api = opentelemetry::metrics; +namespace metrics_sdk = opentelemetry::sdk::metrics; +namespace otlp_http = opentelemetry::exporter::otlp; +namespace resource = opentelemetry::sdk::resource; + +class OTelCollectorImp; + +//------------------------------------------------------------------------------ + +/** + * @brief OTel-backed implementation of beast::insight::HookImpl. + * + * Stores a handler function that is invoked during each periodic + * metric collection cycle. This mirrors the StatsDHookImpl pattern + * where hooks are called at each 1-second timer tick, but here the + * invocation is triggered by the OTel PeriodicMetricReader's + * observable callback mechanism. + */ +class OTelHookImpl : public HookImpl +{ +public: + /** + * @param handler Callback invoked at each collection interval. + * @param impl Owning collector (prevents premature destruction). + */ + OTelHookImpl(HandlerType const& handler, std::shared_ptr const& impl); + + ~OTelHookImpl() override; + + /** + * @brief Invoke the stored handler. + * + * Called by the collector during observable gauge callbacks to give + * metric producers a chance to update gauge values before export. + */ + void + callHandler(); + +private: + OTelHookImpl& + operator=(OTelHookImpl const&); + + /** Owning collector. Prevents collector destruction while hook alive. */ + std::shared_ptr m_impl; + + /** User-supplied handler called at each collection interval. */ + HandlerType m_handler; +}; + +//------------------------------------------------------------------------------ + +/** + * @brief OTel-backed implementation of beast::insight::CounterImpl. + * + * Wraps an OTel Counter instrument. Each increment() call + * is forwarded directly to the OTel counter's Add() method. The + * PeriodicMetricReader collects and exports the accumulated delta. + * + * Thread safety: OTel Counter::Add() is thread-safe by specification. + */ +class OTelCounterImpl : public CounterImpl +{ +public: + /** + * @param name Fully-qualified metric name (prefix.group.name). + * @param meter OTel Meter used to create the counter instrument. + */ + OTelCounterImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter); + + ~OTelCounterImpl() override = default; + + /** + * @brief Add amount to the counter. + * @param amount Value to add (must be non-negative for OTel counters). + */ + void + increment(value_type amount) override; + +private: + OTelCounterImpl& + operator=(OTelCounterImpl const&); + + /** OTel synchronous counter instrument. */ + opentelemetry::nostd::unique_ptr> m_counter; +}; + +//------------------------------------------------------------------------------ + +/** + * @brief OTel-backed implementation of beast::insight::EventImpl. + * + * Wraps an OTel Histogram instrument. Each notify() call + * records the duration in milliseconds. Uses explicit bucket boundaries + * matching the SpanMetrics connector configuration: + * [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms + * + * Thread safety: OTel Histogram::Record() is thread-safe by specification. + */ +class OTelEventImpl : public EventImpl +{ +public: + /** + * @param name Fully-qualified metric name (prefix.group.name). + * @param meter OTel Meter used to create the histogram instrument. + */ + OTelEventImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter); + + ~OTelEventImpl() override = default; + + /** + * @brief Record a duration measurement. + * @param value Duration in milliseconds. + */ + void + notify(value_type const& value) override; + +private: + OTelEventImpl& + operator=(OTelEventImpl const&); + + /** OTel histogram instrument for recording durations. */ + opentelemetry::nostd::unique_ptr> m_histogram; +}; + +//------------------------------------------------------------------------------ + +/** + * @brief OTel-backed implementation of beast::insight::GaugeImpl. + * + * Uses an atomic int64_t to store the current gauge value. The OTel SDK + * reads this value via an ObservableGauge async callback during each + * collection cycle. The set() and increment() methods update the + * atomic value without blocking the collection thread. + * + * Design note: OTel gauges are asynchronous (observable) instruments. + * The SDK calls a registered callback to read the value rather than + * accepting push-style updates. We bridge the beast::insight push-style + * API to OTel's pull-style API via the atomic variable. + * + * Thread safety: std::atomic operations are lock-free on all platforms. + */ +class OTelGaugeImpl : public GaugeImpl +{ +public: + /** + * @param name Fully-qualified metric name (prefix.group.name). + * @param meter OTel Meter used to create the observable gauge. + * @param collector Owning collector, used to invoke hooks before reads. + */ + OTelGaugeImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter, + std::shared_ptr const& collector); + + ~OTelGaugeImpl() override; + + /** + * @brief Set the gauge to an absolute value. + * @param value New gauge value. + */ + void + set(value_type value) override; + + /** + * @brief Increment (or decrement) the gauge by a signed amount. + * + * Clamps the result to [0, UINT64_MAX] to match StatsDGaugeImpl + * behavior. + * + * @param amount Signed amount to add to the current value. + */ + void + increment(difference_type amount) override; + + /** + * @brief Return the current gauge value for the OTel callback. + * @return The most recently set/incremented value. + */ + int64_t + currentValue() const; + +private: + OTelGaugeImpl& + operator=(OTelGaugeImpl const&); + + /** Current gauge value, updated atomically by set()/increment(). */ + std::atomic m_value{0}; + + /** OTel observable gauge handle (prevents deregistration). */ + opentelemetry::nostd::shared_ptr m_gauge; + + /** Owning collector, used to invoke hooks before reading gauge values. */ + std::shared_ptr m_collector; +}; + +//------------------------------------------------------------------------------ + +/** + * @brief OTel-backed implementation of beast::insight::MeterImpl. + * + * Wraps an OTel Counter instrument. Semantically identical + * to Counter but uses unsigned values. The OTel SDK accumulates deltas + * and exports them via the PeriodicMetricReader. + * + * Note: In StatsD, Meter used the non-standard "|m" type which was + * silently dropped by the OTel StatsD receiver. With native OTel, + * Meter values are properly captured as counter deltas. + * + * Thread safety: OTel Counter::Add() is thread-safe by specification. + */ +class OTelMeterImpl : public MeterImpl +{ +public: + /** + * @param name Fully-qualified metric name (prefix.group.name). + * @param meter OTel Meter used to create the counter instrument. + */ + OTelMeterImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter); + + ~OTelMeterImpl() override = default; + + /** + * @brief Add amount to the meter. + * @param amount Value to add (unsigned). + */ + void + increment(value_type amount) override; + +private: + OTelMeterImpl& + operator=(OTelMeterImpl const&); + + /** OTel synchronous counter instrument (unsigned). */ + opentelemetry::nostd::unique_ptr> m_counter; +}; + +//------------------------------------------------------------------------------ + +/** + * @brief Main OTel Collector implementation. + * + * Creates an OTel MeterProvider with a PeriodicMetricReader that + * exports metrics via OTLP/HTTP at 1-second intervals. Implements + * all Collector::make_*() factory methods to create OTel-backed + * instrument wrappers. + * + * Class diagram: + * + * +------------------+ +------------------+ + * | Collector (ABC) |<-----| OTelCollector | + * +------------------+ | (public header) | + * ^ +------------------+ + * | ^ + * +------------------+ | + * | OTelCollectorImp |-------------+ + * +------------------+ + * | - m_journal | + * | - m_prefix | + * | - m_provider | +---------------------+ + * | - m_otelMeter |---->| OTel MeterProvider | + * | - m_hooks[] | | + PeriodicReader | + * | - m_gauges[] | | + OtlpHttpExporter | + * +------------------+ +---------------------+ + * + * Lifecycle: + * 1. Constructor creates MeterProvider + exporter pipeline. + * 2. make_*() methods create instruments registered with the provider. + * 3. PeriodicMetricReader collects every 1s, calling observable callbacks. + * 4. Observable callbacks invoke hooks, read gauge atomics. + * 5. Destructor shuts down MeterProvider (flushes pending exports). + * + * Caveats: + * - Observable gauge callbacks run on the SDK's internal thread. Hook + * handlers must be thread-safe. + * - Metric names are formed as "prefix_name" with dots replaced by + * underscores to match StatsD->Prometheus naming conventions. + * - The OTel Prometheus exporter appends "_total" to counters. The + * metric names we register do NOT include this suffix — Prometheus + * adds it automatically. + * + * Example usage: + * @code + * auto collector = OTelCollector::New( + * "http://localhost:4318/v1/metrics", "rippled", journal); + * auto counter = collector->make_counter("rpc.requests"); + * counter.increment(1); + * // Metric "rippled_rpc_requests" exported via OTLP every 1s. + * @endcode + */ +class OTelCollectorImp : public OTelCollector, public std::enable_shared_from_this +{ +public: + /** + * @brief Construct the OTel collector and initialize the export pipeline. + * + * @param endpoint OTLP/HTTP metrics endpoint URL. + * @param prefix Prefix for all metric names. + * @param instanceId Value for the service.instance.id resource attribute. + * When empty, the attribute is omitted. + * @param journal Journal for logging. + */ + OTelCollectorImp( + std::string const& endpoint, + std::string const& prefix, + std::string const& instanceId, + Journal journal); + + /** + * @brief Shut down the MeterProvider, flushing any pending exports. + */ + ~OTelCollectorImp() override; + + /** @name Collector interface implementation */ + /** @{ */ + Hook + make_hook(HookImpl::HandlerType const& handler) override; + + Counter + make_counter(std::string const& name) override; + + Event + make_event(std::string const& name) override; + + Gauge + make_gauge(std::string const& name) override; + + Meter + make_meter(std::string const& name) override; + /** @} */ + + /** @name Hook management for observable callbacks */ + /** @{ */ + + /** + * @brief Register a hook for periodic invocation. + * @param hook Pointer to the hook to register. + */ + void + addHook(OTelHookImpl* hook); + + /** + * @brief Unregister a hook. + * @param hook Pointer to the hook to unregister. + */ + void + removeHook(OTelHookImpl* hook); + + /** + * @brief Invoke all registered hooks. + * + * Called from observable gauge callbacks before reading gauge values, + * so that hook handlers have a chance to update metrics. + */ + void + callHooks(); + /** @} */ + + /** @name Gauge registration for observable callbacks */ + /** @{ */ + + /** + * @brief Register a gauge for observable callback reading. + * @param gauge Pointer to the gauge to register. + */ + void + addGauge(OTelGaugeImpl* gauge); + + /** + * @brief Unregister a gauge. + * @param gauge Pointer to the gauge to unregister. + */ + void + removeGauge(OTelGaugeImpl* gauge); + /** @} */ + + /** + * @brief Get the OTel Meter instance for creating instruments. + * @return Shared pointer to the OTel Meter. + */ + opentelemetry::nostd::shared_ptr const& + otelMeter() const; + + /** + * @brief Format a metric name with the configured prefix. + * + * Replaces dots with underscores to match StatsD->Prometheus naming. + * Example: prefix="rippled", name="LedgerMaster.Validated_Ledger_Age" + * -> "rippled_LedgerMaster_Validated_Ledger_Age" + * + * @param name Raw metric name from beast::insight callers. + * @return Fully-qualified metric name. + */ + std::string + formatName(std::string const& name) const; + +private: + /** Journal for log output. */ + Journal m_journal; + + /** Prefix for all metric names (e.g., "rippled"). */ + std::string m_prefix; + + /** OTel SDK MeterProvider owning the export pipeline. RAII lifecycle. */ + std::shared_ptr m_provider; + + /** OTel Meter used to create all instruments. */ + opentelemetry::nostd::shared_ptr m_otelMeter; + + /** Mutex protecting hook and gauge registration lists. */ + std::mutex m_mutex; + + /** Registered hooks called during observable callbacks. */ + std::vector m_hooks; + + /** Registered gauges read during observable callbacks. */ + std::vector m_gauges; + + /** + * @brief Debounce timestamp for callHooks(). + * + * Multiple gauge callbacks fire during the same collection cycle. + * This atomic tracks the last time hooks were invoked (ms since epoch). + * Hooks are called at most once per 500ms window to avoid redundant + * invocations while still ensuring fresh values each collection cycle. + */ + std::atomic m_lastHookCallMs{0}; +}; + +//============================================================================== +// Implementation +//============================================================================== + +//------------------------------------------------------------------------------ +// OTelHookImpl +//------------------------------------------------------------------------------ + +OTelHookImpl::OTelHookImpl( + HandlerType const& handler, + std::shared_ptr const& impl) + : m_impl(impl), m_handler(handler) +{ + m_impl->addHook(this); +} + +OTelHookImpl::~OTelHookImpl() +{ + m_impl->removeHook(this); +} + +void +OTelHookImpl::callHandler() +{ + m_handler(); +} + +//------------------------------------------------------------------------------ +// OTelCounterImpl +//------------------------------------------------------------------------------ + +OTelCounterImpl::OTelCounterImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter) + : m_counter(meter->CreateUInt64Counter(name)) +{ +} + +void +OTelCounterImpl::increment(value_type amount) +{ + // OTel counters require non-negative values. beast::insight CounterImpl + // uses int64_t, so clamp negative values to 0 and cast to uint64_t. + if (amount > 0) + m_counter->Add(static_cast(amount)); +} + +//------------------------------------------------------------------------------ +// OTelEventImpl +//------------------------------------------------------------------------------ + +OTelEventImpl::OTelEventImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter) + : m_histogram(meter->CreateDoubleHistogram(name, "Duration in ms", "ms")) +{ +} + +void +OTelEventImpl::notify(value_type const& value) +{ + m_histogram->Record(static_cast(value.count()), opentelemetry::context::Context{}); +} + +//------------------------------------------------------------------------------ +// OTelGaugeImpl +//------------------------------------------------------------------------------ + +OTelGaugeImpl::OTelGaugeImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter, + std::shared_ptr const& collector) + : m_gauge(meter->CreateInt64ObservableGauge(name)), m_collector(collector) +{ + m_collector->addGauge(this); + + // Register the async callback that the SDK calls during collection. + // Before reading the gauge value, invoke all registered hooks so that + // hook handlers (e.g. NetworkOPs State_Accounting) have a chance to + // update gauge values. callHooks() uses a debounce timestamp so hooks + // run at most once per collection cycle even with many gauges. + m_gauge->AddCallback( + [](opentelemetry::metrics::ObserverResult result, void* state) { + auto* self = static_cast(state); + self->m_collector->callHooks(); + if (auto intResult = opentelemetry::nostd::get_if>>(&result)) + { + (*intResult)->Observe(self->currentValue()); + } + }, + this); +} + +OTelGaugeImpl::~OTelGaugeImpl() +{ + m_collector->removeGauge(this); +} + +void +OTelGaugeImpl::set(value_type value) +{ + m_value.store(static_cast(value), std::memory_order_relaxed); +} + +void +OTelGaugeImpl::increment(difference_type amount) +{ + // Use compare-exchange loop to safely clamp to [0, MAX]. + int64_t current = m_value.load(std::memory_order_relaxed); + int64_t desired; + do + { + desired = current + amount; + // Clamp to 0 on underflow. + if (desired < 0) + desired = 0; + } while (!m_value.compare_exchange_weak(current, desired, std::memory_order_relaxed)); +} + +int64_t +OTelGaugeImpl::currentValue() const +{ + return m_value.load(std::memory_order_relaxed); +} + +//------------------------------------------------------------------------------ +// OTelMeterImpl +//------------------------------------------------------------------------------ + +OTelMeterImpl::OTelMeterImpl( + std::string const& name, + opentelemetry::nostd::shared_ptr const& meter) + : m_counter(meter->CreateUInt64Counter(name)) +{ +} + +void +OTelMeterImpl::increment(value_type amount) +{ + m_counter->Add(amount); +} + +//------------------------------------------------------------------------------ +// OTelCollectorImp +//------------------------------------------------------------------------------ + +OTelCollectorImp::OTelCollectorImp( + std::string const& endpoint, + std::string const& prefix, + std::string const& instanceId, + Journal journal) + : m_journal(journal), m_prefix(prefix) +{ + if (m_journal.info()) + m_journal.info() << "OTelCollector starting: endpoint=" << endpoint + << " prefix=" << m_prefix; + + // Configure OTLP HTTP metric exporter. + otlp_http::OtlpHttpMetricExporterOptions exporterOpts; + exporterOpts.url = endpoint; + + auto exporter = otlp_http::OtlpHttpMetricExporterFactory::Create(exporterOpts); + + // Configure periodic metric reader (1-second export interval). + metrics_sdk::PeriodicExportingMetricReaderOptions readerOpts; + readerOpts.export_interval_millis = std::chrono::milliseconds(1000); + readerOpts.export_timeout_millis = std::chrono::milliseconds(500); + + auto reader = + metrics_sdk::PeriodicExportingMetricReaderFactory::Create(std::move(exporter), readerOpts); + + // Configure resource attributes matching the trace exporter. + // Include service.instance.id when provided so Prometheus + // exported_instance labels distinguish multi-node deployments. + resource::ResourceAttributes attrs; + attrs[resource::SemanticConventions::kServiceName] = "rippled"; + if (!instanceId.empty()) + attrs[resource::SemanticConventions::kServiceInstanceId] = instanceId; + auto resourceAttrs = resource::Resource::Create(attrs); + + // Create MeterProvider with resource, then attach the metric reader. + m_provider = metrics_sdk::MeterProviderFactory::Create( + std::make_unique(), resourceAttrs); + m_provider->AddMetricReader(std::move(reader)); + + // Configure histogram bucket boundaries for Event instruments. + // These match the SpanMetrics connector buckets for consistency. + auto histogramSelector = metrics_sdk::InstrumentSelectorFactory::Create( + metrics_sdk::InstrumentType::kHistogram, "*", "ms"); + auto meterSelector = metrics_sdk::MeterSelectorFactory::Create("rippled_metrics", "", ""); + auto histogramConfig = std::make_shared(); + histogramConfig->boundaries_ = + std::vector{1.0, 5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 5000.0}; + auto histogramView = metrics_sdk::ViewFactory::Create( + "default_histogram", + "Default histogram view with SpanMetrics-compatible buckets", + "ms", + metrics_sdk::AggregationType::kHistogram, + std::move(histogramConfig)); + + m_provider->AddView( + std::move(histogramSelector), std::move(meterSelector), std::move(histogramView)); + + // Create the OTel Meter for creating instruments. + m_otelMeter = m_provider->GetMeter("rippled_metrics", "1.0.0"); + + if (m_journal.info()) + m_journal.info() << "OTelCollector started successfully"; +} + +OTelCollectorImp::~OTelCollectorImp() +{ + if (m_journal.info()) + m_journal.info() << "OTelCollector shutting down"; + if (m_provider) + { + // ForceFlush to export any pending metrics before shutdown. + m_provider->ForceFlush(); + m_provider->Shutdown(); + } + if (m_journal.info()) + m_journal.info() << "OTelCollector stopped"; +} + +Hook +OTelCollectorImp::make_hook(HookImpl::HandlerType const& handler) +{ + return Hook(std::make_shared(handler, shared_from_this())); +} + +Counter +OTelCollectorImp::make_counter(std::string const& name) +{ + return Counter(std::make_shared(formatName(name), m_otelMeter)); +} + +Event +OTelCollectorImp::make_event(std::string const& name) +{ + return Event(std::make_shared(formatName(name), m_otelMeter)); +} + +Gauge +OTelCollectorImp::make_gauge(std::string const& name) +{ + return Gauge( + std::make_shared(formatName(name), m_otelMeter, shared_from_this())); +} + +Meter +OTelCollectorImp::make_meter(std::string const& name) +{ + return Meter(std::make_shared(formatName(name), m_otelMeter)); +} + +void +OTelCollectorImp::addHook(OTelHookImpl* hook) +{ + std::lock_guard lock(m_mutex); + m_hooks.push_back(hook); +} + +void +OTelCollectorImp::removeHook(OTelHookImpl* hook) +{ + std::lock_guard lock(m_mutex); + m_hooks.erase(std::remove(m_hooks.begin(), m_hooks.end(), hook), m_hooks.end()); +} + +void +OTelCollectorImp::callHooks() +{ + // Debounce: hooks run at most once per 500ms. Multiple gauge callbacks + // fire during the same collection cycle — only the first one triggers + // hooks. Subsequent callbacks within the window read already-updated + // gauge values. + auto now = std::chrono::duration_cast( + std::chrono::steady_clock::now().time_since_epoch()) + .count(); + auto last = m_lastHookCallMs.load(std::memory_order_relaxed); + if (now - last < 500) + return; + if (!m_lastHookCallMs.compare_exchange_strong(last, now, std::memory_order_relaxed)) + return; // Another thread won the race. + + std::lock_guard lock(m_mutex); + for (auto* hook : m_hooks) + hook->callHandler(); +} + +void +OTelCollectorImp::addGauge(OTelGaugeImpl* gauge) +{ + std::lock_guard lock(m_mutex); + m_gauges.push_back(gauge); +} + +void +OTelCollectorImp::removeGauge(OTelGaugeImpl* gauge) +{ + std::lock_guard lock(m_mutex); + m_gauges.erase(std::remove(m_gauges.begin(), m_gauges.end(), gauge), m_gauges.end()); +} + +opentelemetry::nostd::shared_ptr const& +OTelCollectorImp::otelMeter() const +{ + return m_otelMeter; +} + +std::string +OTelCollectorImp::formatName(std::string const& name) const +{ + // StatsD uses "prefix.group.name" format. The OTel StatsD receiver + // converts dots to underscores for Prometheus. We replicate this + // to preserve metric name compatibility. + // + // Example: prefix="rippled", name="LedgerMaster.Validated_Ledger_Age" + // -> "rippled_LedgerMaster_Validated_Ledger_Age" + std::string result; + if (!m_prefix.empty()) + { + result = m_prefix; + result += '_'; + } + for (char c : name) + { + result += (c == '.') ? '_' : c; + } + return result; +} + +} // namespace detail + +//------------------------------------------------------------------------------ + +std::shared_ptr +OTelCollector::New( + std::string const& endpoint, + std::string const& prefix, + std::string const& instanceId, + Journal journal) +{ + return std::make_shared(endpoint, prefix, instanceId, journal); +} + +} // namespace insight +} // namespace beast + +#else // !XRPL_ENABLE_TELEMETRY + +// When telemetry is disabled at compile time, OTelCollector::New() +// returns a NullCollector so callers do not need conditional logic. + +#include +#include + +namespace beast { +namespace insight { + +std::shared_ptr +OTelCollector::New( + std::string const& /* endpoint */, + std::string const& /* prefix */, + std::string const& /* instanceId */, + Journal /* journal */) +{ + return NullCollector::New(); +} + +} // namespace insight +} // namespace beast + +#endif // XRPL_ENABLE_TELEMETRY diff --git a/src/xrpld/app/main/CollectorManager.cpp b/src/xrpld/app/main/CollectorManager.cpp index 353a49de91..0844019846 100644 --- a/src/xrpld/app/main/CollectorManager.cpp +++ b/src/xrpld/app/main/CollectorManager.cpp @@ -23,6 +23,24 @@ public: m_collector = beast::insight::StatsDCollector::New(address, prefix, journal); } + // LCOV_EXCL_START -- OTel collector path is not exercised in unit tests + else if (server == "otel") + { + // Read OTLP metrics endpoint from [insight] section. + // Default to the standard OTLP/HTTP metrics path on localhost. + std::string endpoint = get(params, "endpoint"); + if (endpoint.empty()) + endpoint = "http://localhost:4318/v1/metrics"; + std::string const& prefix(get(params, "prefix")); + + // Read service_instance_id, same key as the [telemetry] + // section uses, so multi-node deployments can distinguish + // metric sources via the exported_instance Prometheus label. + std::string const instanceId = get(params, "service_instance_id"); + + m_collector = beast::insight::OTelCollector::New(endpoint, prefix, instanceId, journal); + } + // LCOV_EXCL_STOP else { m_collector = beast::insight::NullCollector::New();