mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -556,6 +556,8 @@ span->SetAttribute("peer.id", peerId);
|
||||
|
||||
### 2.6.4 Coexistence Strategy
|
||||
|
||||
> **Note**: Phase 7 replaces the StatsD bridge with native OTel Metrics SDK export. The diagram below shows the Phase 6 intermediate state. See [Phase7_taskList.md](./Phase7_taskList.md) for the migration design where Beast Insight emits via OTLP instead of StatsD.
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph rippled["rippled Process"]
|
||||
@@ -584,6 +586,8 @@ flowchart TB
|
||||
- **OpenTelemetry to OTLP Collector**: OTel exports spans over OTLP/gRPC to a Collector, which then forwards to a trace backend (Tempo).
|
||||
- **Grafana (red, unified UI)**: All three data streams converge in Grafana, enabling operators to correlate logs, metrics, and traces in a single dashboard.
|
||||
|
||||
**Phase 7 target state**: Beast Insight routes to `OTelCollector` (new `Collector` implementation) which exports via OTLP/HTTP to the same collector endpoint as traces. StatsD UDP path becomes a deprecated fallback (`[insight] server=statsd`). See [06-implementation-phases.md §6.8](./06-implementation-phases.md) and [Phase7_taskList.md](./Phase7_taskList.md) for details.
|
||||
|
||||
### 2.6.5 Correlation with PerfLog
|
||||
|
||||
Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging:
|
||||
|
||||
@@ -921,18 +921,22 @@ jsonData:
|
||||
filterBySpanID: false
|
||||
```
|
||||
|
||||
### 5.8.7 Correlation with Insight/StatsD Metrics
|
||||
### 5.8.7 Correlation with Insight/OTel System Metrics
|
||||
|
||||
To correlate traces with existing Beast Insight metrics:
|
||||
To correlate traces with Beast Insight system metrics:
|
||||
|
||||
**Step 1: Export Insight metrics to Prometheus**
|
||||
|
||||
```yaml
|
||||
# prometheus.yaml
|
||||
scrape_configs:
|
||||
- job_name: "rippled-statsd"
|
||||
static_configs:
|
||||
- targets: ["statsd-exporter:9102"]
|
||||
Beast Insight metrics are exported natively via OTLP to the OTel Collector,
|
||||
which exposes them on the Prometheus endpoint alongside spanmetrics. No
|
||||
separate StatsD exporter is needed when using `server=otel`.
|
||||
|
||||
```ini
|
||||
# xrpld.cfg — native OTel metrics (recommended)
|
||||
[insight]
|
||||
server=otel
|
||||
endpoint=http://localhost:4318/v1/metrics
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
**Step 2: Add exemplars to metrics**
|
||||
|
||||
@@ -355,7 +355,187 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi
|
||||
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
|
||||
- [ ] All 3 new Grafana dashboards load without errors
|
||||
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
|
||||
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately)
|
||||
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping)
|
||||
|
||||
---
|
||||
|
||||
## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12)
|
||||
|
||||
**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline.
|
||||
|
||||
### Motivation: Why Migrate from StatsD to Native OTel Metrics
|
||||
|
||||
The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves.
|
||||
|
||||
#### What We Gain
|
||||
|
||||
1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."
|
||||
|
||||
2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.
|
||||
|
||||
3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved).
|
||||
|
||||
4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.
|
||||
|
||||
5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.
|
||||
|
||||
6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics.
|
||||
|
||||
7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`.
|
||||
|
||||
#### What We Lose
|
||||
|
||||
1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback.
|
||||
|
||||
2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.
|
||||
|
||||
3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.
|
||||
|
||||
4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.
|
||||
|
||||
#### Decision
|
||||
|
||||
The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period.
|
||||
|
||||
### Architecture
|
||||
|
||||
#### Class Hierarchy (after Phase 7)
|
||||
|
||||
```
|
||||
beast::insight::Collector (abstract interface — unchanged)
|
||||
|
|
||||
+-- StatsDCollector (existing — retained as fallback, deprecated)
|
||||
| +-- StatsDCounterImpl -> StatsD |c over UDP
|
||||
| +-- StatsDGaugeImpl -> StatsD |g over UDP
|
||||
| +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard)
|
||||
| +-- StatsDEventImpl -> StatsD |ms over UDP
|
||||
| +-- StatsDHookImpl -> 1s periodic callback
|
||||
|
|
||||
+-- NullCollector (existing — unchanged, used when disabled)
|
||||
| +-- NullCounterImpl -> no-op
|
||||
| +-- NullGaugeImpl -> no-op
|
||||
| +-- NullMeterImpl -> no-op
|
||||
| +-- NullEventImpl -> no-op
|
||||
| +-- NullHookImpl -> no-op
|
||||
|
|
||||
+-- OTelCollector (NEW — Phase 7)
|
||||
+-- OTelCounterImpl -> otel::Counter<int64_t>
|
||||
+-- OTelGaugeImpl -> otel::ObservableGauge<uint64_t>
|
||||
+-- OTelMeterImpl -> otel::Counter<uint64_t>
|
||||
+-- OTelEventImpl -> otel::Histogram<double>
|
||||
+-- OTelHookImpl -> 1s periodic callback (same pattern)
|
||||
```
|
||||
|
||||
#### Data Flow (after Phase 7)
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph rippledNode["rippled Node"]
|
||||
A["Trace Macros<br/>XRPL_TRACE_SPAN"]
|
||||
B["beast::insight<br/>OTelCollector"]
|
||||
end
|
||||
|
||||
subgraph collector["OTel Collector :4317 / :4318"]
|
||||
direction TB
|
||||
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
|
||||
BP["Batch Processor"]
|
||||
SM["SpanMetrics Connector"]
|
||||
|
||||
R1 --> BP
|
||||
BP --> SM
|
||||
end
|
||||
|
||||
subgraph backends["Trace Backends"]
|
||||
D["Jaeger / Tempo"]
|
||||
end
|
||||
|
||||
subgraph metrics["Metrics Stack"]
|
||||
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
|
||||
end
|
||||
|
||||
subgraph viz["Visualization"]
|
||||
F["Grafana :3000"]
|
||||
end
|
||||
|
||||
A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
|
||||
B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1
|
||||
|
||||
BP -->|"OTLP/gRPC"| D
|
||||
SM -->|"RED metrics"| E
|
||||
R1 -->|"rippled_* metrics<br/>(native OTLP)"| E
|
||||
|
||||
E --> F
|
||||
D --> F
|
||||
|
||||
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
|
||||
style B fill:#d9534f,color:#fff,stroke:#b52d2d
|
||||
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
|
||||
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
|
||||
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
|
||||
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
|
||||
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
|
||||
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
|
||||
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
|
||||
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
|
||||
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
|
||||
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
|
||||
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
|
||||
```
|
||||
|
||||
**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.
|
||||
|
||||
#### Configuration
|
||||
|
||||
```ini
|
||||
# [insight] section — new "otel" server option
|
||||
[insight]
|
||||
server=otel # NEW: uses OTel OTLP metrics exporter
|
||||
prefix=rippled # metric name prefix (preserved)
|
||||
|
||||
# Endpoint and auth inherited from [telemetry] section:
|
||||
[telemetry]
|
||||
enabled=1
|
||||
endpoint=http://localhost:4318/v1/traces
|
||||
```
|
||||
|
||||
The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed.
|
||||
|
||||
**Backward compatibility**: `server=statsd` continues to work exactly as before.
|
||||
|
||||
See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown.
|
||||
|
||||
### Instrument Type Mapping
|
||||
|
||||
| beast::insight | OTel Metrics SDK | Rationale |
|
||||
| ---------------------- | -------------------------------- | ---------------------------------------------------------------- |
|
||||
| Counter (int64, `\|c`) | `Counter<int64_t>` | Direct 1:1 mapping |
|
||||
| Gauge (uint64, `\|g`) | `ObservableGauge<uint64_t>` | Async callback matches existing Hook polling pattern |
|
||||
| Meter (uint64, `\|m`) | `Counter<uint64_t>` | Fixes non-standard wire format; meters are semantically counters |
|
||||
| Event (ms, `\|ms`) | `Histogram<double>` | Duration distributions with explicit bucket boundaries |
|
||||
| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval |
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description |
|
||||
| ---- | ------------------------------------------------------------------------- |
|
||||
| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) |
|
||||
| 7.2 | Implement `OTelCollector` class (~400-500 lines) |
|
||||
| 7.3 | Update `CollectorManager` — add `server=otel` |
|
||||
| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) |
|
||||
| 7.5 | Preserve metric names in Prometheus (naming strategy) |
|
||||
| 7.6 | Update Grafana dashboards (if names change) |
|
||||
| 7.7 | Update integration tests |
|
||||
| 7.8 | Update documentation (runbook, reference docs) |
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
|
||||
- [ ] `server=otel` is the default in development docker-compose
|
||||
- [ ] `server=statsd` still works as a fallback
|
||||
- [ ] Existing Grafana dashboards display data correctly
|
||||
- [ ] Integration test passes with OTLP-only metrics pipeline
|
||||
- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead)
|
||||
- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant
|
||||
|
||||
---
|
||||
|
||||
@@ -636,14 +816,15 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
### 6.13.6 Success Metrics Summary
|
||||
|
||||
|
||||
| Phase | Primary Metric | Secondary Metric | Deadline |
|
||||
| ------- | ---------------------- | --------------------------- | ------------- |
|
||||
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 |
|
||||
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 |
|
||||
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 |
|
||||
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 |
|
||||
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
|
||||
| Phase | Primary Metric | Secondary Metric | Deadline |
|
||||
| ------- | ---------------------------- | --------------------------- | -------------- |
|
||||
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 |
|
||||
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 |
|
||||
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 |
|
||||
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 |
|
||||
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
|
||||
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 |
|
||||
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -195,6 +195,7 @@ flowchart TB
|
||||
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
|
||||
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
|
||||
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
|
||||
| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration |
|
||||
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
|
||||
|
||||
---
|
||||
|
||||
@@ -10,13 +10,12 @@
|
||||
graph LR
|
||||
subgraph rippledNode["rippled Node"]
|
||||
A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
|
||||
B["beast::insight<br/>StatsD metrics<br/>(UDP sender)"]
|
||||
B["beast::insight<br/>OTel native metrics<br/>(OTLP/HTTP exporter)"]
|
||||
end
|
||||
|
||||
subgraph collector["OTel Collector :4317 / :4318 / :8125"]
|
||||
subgraph collector["OTel Collector :4317 / :4318"]
|
||||
direction TB
|
||||
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
|
||||
R2["StatsD Receiver<br/>:8125 UDP"]
|
||||
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP<br/>(traces + metrics)"]
|
||||
BP["Batch Processor<br/>timeout 1s, batch 100"]
|
||||
SM["SpanMetrics Connector<br/>derives RED metrics<br/>from trace spans"]
|
||||
|
||||
@@ -30,7 +29,7 @@ graph LR
|
||||
end
|
||||
|
||||
subgraph metrics["Metrics Stack"]
|
||||
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + StatsD metrics"]
|
||||
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + system metrics"]
|
||||
end
|
||||
|
||||
subgraph viz["Visualization"]
|
||||
@@ -38,22 +37,21 @@ graph LR
|
||||
end
|
||||
|
||||
A -->|"OTLP/HTTP :4318<br/>(traces + attributes)"| R1
|
||||
B -->|"UDP :8125<br/>(gauges, counters, timers)"| R2
|
||||
B -->|"OTLP/HTTP :4318<br/>(gauges, counters, histograms)"| R1
|
||||
|
||||
BP -->|"OTLP/gRPC :4317"| D
|
||||
BP -->|"OTLP/gRPC"| T
|
||||
|
||||
SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
|
||||
R2 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* summaries"| E
|
||||
R1 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* histograms"| E
|
||||
|
||||
E -->|"Prometheus<br/>data source"| F
|
||||
D -->|"Jaeger<br/>data source"| F
|
||||
T -->|"Tempo<br/>data source"| F
|
||||
|
||||
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
|
||||
style B fill:#d9534f,color:#fff,stroke:#b52d2d
|
||||
style B fill:#4a90d9,color:#fff,stroke:#2a6db5
|
||||
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
|
||||
style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d
|
||||
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
|
||||
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
|
||||
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
|
||||
@@ -67,10 +65,10 @@ graph LR
|
||||
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
|
||||
```
|
||||
|
||||
There are two independent telemetry pipelines entering a single **OTel Collector**:
|
||||
There are two independent telemetry pipelines entering a single **OTel Collector** via the same OTLP receiver:
|
||||
|
||||
1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
|
||||
2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus.
|
||||
2. **beast::insight OTel Metrics** — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same **OTLP Receiver**. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP; `server=statsd` remains available as a fallback.
|
||||
|
||||
**Trace backends** — The collector exports traces via OTLP/gRPC to one or both:
|
||||
|
||||
@@ -268,14 +266,26 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er
|
||||
|
||||
---
|
||||
|
||||
## 2. StatsD Metrics (beast::insight)
|
||||
## 2. System Metrics (beast::insight — OTel native)
|
||||
|
||||
> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory.
|
||||
> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6/7 metric inventory.
|
||||
>
|
||||
> **Migration complete**: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The `beast::insight::Collector` interface and all metric names are preserved — only the wire protocol changed. `[insight] server=statsd` remains as a fallback.
|
||||
|
||||
These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans.
|
||||
These are system-level metrics emitted by rippled's `beast::insight` framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.
|
||||
|
||||
### Configuration
|
||||
|
||||
```ini
|
||||
# Recommended: native OTel metrics via OTLP/HTTP
|
||||
[insight]
|
||||
server=otel
|
||||
endpoint=http://localhost:4318/v1/metrics
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
Fallback (StatsD):
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=statsd
|
||||
@@ -305,7 +315,7 @@ prefix=rippled
|
||||
| `rippled_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) |
|
||||
| `rippled_job_count` | JobQueue.cpp | Current job queue depth | 0–100 (healthy) |
|
||||
|
||||
**Grafana dashboard**: _Node Health (StatsD)_ (`rippled-statsd-node-health`)
|
||||
**Grafana dashboard**: _Node Health (System Metrics)_ (`rippled-system-node-health`)
|
||||
|
||||
### 2.2 Counters
|
||||
|
||||
@@ -317,11 +327,11 @@ prefix=rippled
|
||||
| `rippled_warn` | Logic.h | Resource manager warnings issued |
|
||||
| `rippled_drop` | Logic.h | Resource manager drops (connections rejected) |
|
||||
|
||||
**Note**: `rippled_warn` and `rippled_drop` use non-standard StatsD meter type (`|m`). The OTel StatsD receiver only recognizes `|c`, `|g`, `|ms`, `|h`, `|s` — these metrics may be silently dropped. See Known Issues below.
|
||||
**Note**: With `server=otel`, `rippled_warn` and `rippled_drop` are properly exported as OTel Counter instruments. The previous StatsD `|m` type limitation no longer applies.
|
||||
|
||||
**Grafana dashboard**: _RPC & Pathfinding (StatsD)_ (`rippled-statsd-rpc`)
|
||||
**Grafana dashboard**: _RPC & Pathfinding (System Metrics)_ (`rippled-system-rpc`)
|
||||
|
||||
### 2.3 Histograms (from StatsD timers)
|
||||
### 2.3 Histograms (Event timers)
|
||||
|
||||
| Prometheus Metric | Source File | Unit | Description |
|
||||
| ----------------------- | ----------------- | ----- | ------------------------------ |
|
||||
@@ -361,7 +371,7 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
|
||||
| `ping` / `status` | Keepalive and status |
|
||||
| `set_get` | Set requests |
|
||||
|
||||
**Grafana dashboards**: _Network Traffic_ (`rippled-statsd-network`), _Overlay Traffic Detail_ (`rippled-statsd-overlay-detail`), _Ledger Data & Sync_ (`rippled-statsd-ledger-sync`)
|
||||
**Grafana dashboards**: _Network Traffic_ (`rippled-system-network`), _Overlay Traffic Detail_ (`rippled-system-overlay-detail`), _Ledger Data & Sync_ (`rippled-system-ledger-sync`)
|
||||
|
||||
---
|
||||
|
||||
@@ -379,15 +389,15 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
|
||||
| Ledger Operations | `rippled-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
|
||||
| Peer Network | `rippled-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
|
||||
|
||||
### 3.2 StatsD Dashboards (5)
|
||||
### 3.2 System Metrics Dashboards (5)
|
||||
|
||||
| Dashboard | UID | Data Source | Key Panels |
|
||||
| ---------------------- | ------------------------------- | ------------------- | --------------------------------------------------------------------------------- |
|
||||
| Node Health | `rippled-statsd-node-health` | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
|
||||
| Network Traffic | `rippled-statsd-network` | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
|
||||
| RPC & Pathfinding | `rippled-statsd-rpc` | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
|
||||
| Overlay Traffic Detail | `rippled-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
|
||||
| Ledger Data & Sync | `rippled-statsd-ledger-sync` | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
|
||||
| Dashboard | UID | Data Source | Key Panels |
|
||||
| ---------------------- | ------------------------------- | ----------------- | --------------------------------------------------------------------------------- |
|
||||
| Node Health | `rippled-system-node-health` | Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
|
||||
| Network Traffic | `rippled-system-network` | Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
|
||||
| RPC & Pathfinding | `rippled-system-rpc` | Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
|
||||
| Overlay Traffic Detail | `rippled-system-overlay-detail` | Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
|
||||
| Ledger Data & Sync | `rippled-system-ledger-sync` | Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
|
||||
|
||||
### 3.3 Accessing the Dashboards
|
||||
|
||||
@@ -443,7 +453,7 @@ ledger.store (persist to DB)
|
||||
|
||||
## 5. Prometheus Query Examples
|
||||
|
||||
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics.
|
||||
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus system metrics with trace-derived metrics.
|
||||
|
||||
### Span-Derived Metrics
|
||||
|
||||
|
||||
@@ -187,17 +187,19 @@ OpenTelemetry Collector configurations are provided for development and producti
|
||||
|
||||
## 6. Implementation Phases
|
||||
|
||||
The implementation spans 9 weeks across 5 phases:
|
||||
The implementation spans 12 weeks across 7 phases:
|
||||
|
||||
| Phase | Duration | Focus | Key Deliverables |
|
||||
| ----- | --------- | ------------------- | --------------------------------------------------- |
|
||||
| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration |
|
||||
| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation |
|
||||
| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation |
|
||||
| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing |
|
||||
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
|
||||
| Phase | Duration | Focus | Key Deliverables |
|
||||
| ----- | ----------- | --------------------- | ----------------------------------------------------------- |
|
||||
| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration |
|
||||
| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation |
|
||||
| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation |
|
||||
| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing |
|
||||
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
|
||||
| 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards |
|
||||
| 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation |
|
||||
|
||||
**Total Effort**: 47 person-days (2 developers working in parallel)
|
||||
**Total Effort**: 60.6 developer-days with 2 developers
|
||||
|
||||
➡️ **[View full Implementation Phases](./06-implementation-phases.md)**
|
||||
|
||||
|
||||
254
OpenTelemetryPlan/Phase7_taskList.md
Normal file
254
OpenTelemetryPlan/Phase7_taskList.md
Normal file
@@ -0,0 +1,254 @@
|
||||
# Phase 7: Native OTel Metrics Migration — Task List
|
||||
|
||||
> **Goal**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency.
|
||||
>
|
||||
> **Scope**: New `OTelCollectorImpl` class, `CollectorManager` config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.
|
||||
>
|
||||
> **Branch**: `pratik/otel-phase7-native-metrics` (from `pratik/otel-phase6-statsd`)
|
||||
|
||||
### Related Plan Documents
|
||||
|
||||
| Document | Relevance |
|
||||
| -------------------------------------------------------------------- | --------------------------------------------------------------- |
|
||||
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 plan: motivation, architecture, exit criteria (§6.8) |
|
||||
| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy |
|
||||
| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections |
|
||||
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved |
|
||||
|
||||
---
|
||||
|
||||
## Task 7.1: Add OTel Metrics SDK to Build Dependencies
|
||||
|
||||
**Objective**: Enable the OTel C++ Metrics SDK components in the build system.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `conanfile.py`:
|
||||
- Add OTel metrics SDK components to the dependency list when `telemetry=True`
|
||||
- Components needed: `opentelemetry-cpp::metrics`, `opentelemetry-cpp::otlp_http_metric_exporter`
|
||||
|
||||
- Edit `CMakeLists.txt` (telemetry section):
|
||||
- Link `opentelemetry::metrics` and `opentelemetry::otlp_http_metric_exporter` targets
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `conanfile.py`
|
||||
- `CMakeLists.txt` (or the relevant telemetry cmake target)
|
||||
|
||||
**Reference**: [05-configuration-reference.md §5.3](./05-configuration-reference.md) — CMake integration
|
||||
|
||||
---
|
||||
|
||||
## Task 7.2: Implement OTelCollector Class
|
||||
|
||||
**Objective**: Create the core `OTelCollector` implementation that maps beast::insight instruments to OTel Metrics SDK instruments.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Create `include/xrpl/beast/insight/OTelCollector.h`:
|
||||
- Public factory: `static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)`
|
||||
- Derives from `StatsDCollector` (or directly from `Collector` — TBD based on shared code)
|
||||
|
||||
- Create `src/libxrpl/beast/insight/OTelCollector.cpp` (~400-500 lines):
|
||||
- **OTelCounterImpl**: Wraps `opentelemetry::metrics::Counter<int64_t>`. `increment(amount)` calls `counter->Add(amount)`.
|
||||
- **OTelGaugeImpl**: Uses `opentelemetry::metrics::ObservableGauge<uint64_t>` with an async callback. `set(value)` stores value atomically; callback reads it during collection.
|
||||
- **OTelMeterImpl**: Wraps `opentelemetry::metrics::Counter<uint64_t>`. `increment(amount)` calls `counter->Add(amount)`. Semantically identical to Counter but unsigned.
|
||||
- **OTelEventImpl**: Wraps `opentelemetry::metrics::Histogram<double>`. `notify(duration)` calls `histogram->Record(duration.count())`. Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms.
|
||||
- **OTelHookImpl**: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
|
||||
- **OTelCollectorImp**: Main class.
|
||||
- Creates `MeterProvider` with `PeriodicMetricReader` (1s export interval)
|
||||
- Creates `OtlpHttpMetricExporter` pointing to `[telemetry]` endpoint
|
||||
- Sets resource attributes (service.name, service.instance.id) matching trace exporter
|
||||
- Implements all `make_*()` factory methods
|
||||
- Prefixes metric names with `[insight] prefix=` value
|
||||
|
||||
- Guard all OTel SDK includes with `#ifdef XRPL_ENABLE_TELEMETRY` to compile to `NullCollector` equivalents when telemetry disabled.
|
||||
|
||||
**Key new files**:
|
||||
|
||||
- `include/xrpl/beast/insight/OTelCollector.h`
|
||||
- `src/libxrpl/beast/insight/OTelCollector.cpp`
|
||||
|
||||
**Key patterns to follow**:
|
||||
|
||||
- Match `StatsDCollector.cpp` structure: private impl classes, intrusive list for metrics, strand-based thread safety
|
||||
- Match existing telemetry code style from `src/libxrpl/telemetry/Telemetry.cpp`
|
||||
- Use RAII for MeterProvider lifecycle (shutdown on destructor)
|
||||
|
||||
**Reference**: [04-code-samples.md](./04-code-samples.md) — code style and patterns
|
||||
|
||||
---
|
||||
|
||||
## Task 7.3: Update CollectorManager
|
||||
|
||||
**Objective**: Add `server=otel` config option to route metric creation to the new OTel backend.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/xrpld/app/main/CollectorManager.cpp`:
|
||||
- In the constructor, add a third branch after `server == "statsd"`:
|
||||
```cpp
|
||||
else if (server == "otel")
|
||||
{
|
||||
// Read endpoint from [telemetry] section
|
||||
auto const endpoint = get(telemetryParams, "endpoint",
|
||||
"http://localhost:4318/v1/metrics");
|
||||
std::string const& prefix(get(params, "prefix"));
|
||||
m_collector = beast::insight::OTelCollector::New(
|
||||
endpoint, prefix, journal);
|
||||
}
|
||||
```
|
||||
- This requires access to the `[telemetry]` config section — may need to pass it as a parameter or read from Application config.
|
||||
|
||||
- Edit `src/xrpld/app/main/CollectorManager.h`:
|
||||
- Add `#include <xrpl/beast/insight/OTelCollector.h>`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `src/xrpld/app/main/CollectorManager.cpp`
|
||||
- `src/xrpld/app/main/CollectorManager.h`
|
||||
|
||||
---
|
||||
|
||||
## Task 7.4: Update OTel Collector Configuration
|
||||
|
||||
**Objective**: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/otel-collector-config.yaml`:
|
||||
- Remove `statsd` receiver (no longer needed when `server=otel`)
|
||||
- Add metrics pipeline under `service.pipelines`:
|
||||
```yaml
|
||||
metrics:
|
||||
receivers: [otlp, spanmetrics]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
```
|
||||
- The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
|
||||
- Keep `spanmetrics` connector in the metrics pipeline so span-derived RED metrics continue working.
|
||||
|
||||
- Edit `docker/telemetry/docker-compose.yml`:
|
||||
- Remove UDP :8125 port mapping from otel-collector service
|
||||
- Update rippled service config: change `[insight] server=statsd` to `server=otel`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/otel-collector-config.yaml`
|
||||
- `docker/telemetry/docker-compose.yml`
|
||||
|
||||
**Note**: Keep a commented-out `statsd` receiver block for operators who need backward compatibility.
|
||||
|
||||
---
|
||||
|
||||
## Task 7.5: Preserve Metric Names in Prometheus
|
||||
|
||||
**Objective**: Ensure existing Grafana dashboards continue working with identical metric names.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- In `OTelCollector.cpp`, construct OTel instrument names to match existing Prometheus metric names:
|
||||
- beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `rippled_LedgerMaster_Validated_Ledger_Age`
|
||||
- The prefix + group + name concatenation must produce the same string as `StatsDCollector`'s format
|
||||
- Use underscores as separators (matching StatsD convention)
|
||||
|
||||
- Verify in integration test that key Prometheus queries still return data:
|
||||
- `rippled_LedgerMaster_Validated_Ledger_Age`
|
||||
- `rippled_Peer_Finder_Active_Inbound_Peers`
|
||||
- `rippled_rpc_requests`
|
||||
|
||||
**Key consideration**: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds `_total` suffix to counters and converts dots to underscores — match existing conventions.
|
||||
|
||||
---
|
||||
|
||||
## Task 7.6: Update Grafana Dashboards
|
||||
|
||||
**Objective**: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
|
||||
- If OTLP export produces different names (e.g., `_total` suffix on counters), update:
|
||||
- `docker/telemetry/grafana/dashboards/statsd-node-health.json`
|
||||
- `docker/telemetry/grafana/dashboards/statsd-network-traffic.json`
|
||||
- `docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json`
|
||||
- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/grafana/dashboards/statsd-*.json` (3 files, conditionally)
|
||||
|
||||
---
|
||||
|
||||
## Task 7.7: Update Integration Tests
|
||||
|
||||
**Objective**: Verify the full OTLP metrics pipeline end-to-end.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/integration-test.sh`:
|
||||
- Update test config to use `[insight] server=otel`
|
||||
- Verify metrics arrive in Prometheus via OTLP (not StatsD)
|
||||
- Add check that StatsD receiver is no longer required
|
||||
- Preserve all existing metric presence checks
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/integration-test.sh`
|
||||
|
||||
---
|
||||
|
||||
## Task 7.8: Update Documentation
|
||||
|
||||
**Objective**: Update all plan docs, runbook, and reference docs to reflect the migration.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docs/telemetry-runbook.md`:
|
||||
- Update `[insight]` config examples to show `server=otel`
|
||||
- Update troubleshooting section (no more StatsD UDP debugging)
|
||||
|
||||
- Edit `OpenTelemetryPlan/09-data-collection-reference.md`:
|
||||
- Update Data Flow Overview diagram (remove StatsD receiver)
|
||||
- Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
|
||||
- Update config examples
|
||||
|
||||
- Edit `OpenTelemetryPlan/05-configuration-reference.md`:
|
||||
- Add `server=otel` option to `[insight]` section docs
|
||||
|
||||
- Edit `docker/telemetry/TESTING.md`:
|
||||
- Update setup instructions to use `server=otel`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docs/telemetry-runbook.md`
|
||||
- `OpenTelemetryPlan/09-data-collection-reference.md`
|
||||
- `OpenTelemetryPlan/05-configuration-reference.md`
|
||||
- `docker/telemetry/TESTING.md`
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Task | Description | New Files | Modified Files | Depends On |
|
||||
| ---- | -------------------------------------- | --------- | -------------- | ---------- |
|
||||
| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | — |
|
||||
| 7.2 | Implement OTelCollector class | 2 | 0 | 7.1 |
|
||||
| 7.3 | Update CollectorManager config routing | 0 | 2 | 7.2 |
|
||||
| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 7.3 |
|
||||
| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 7.2 |
|
||||
| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 7.5 |
|
||||
| 7.7 | Update integration tests | 0 | 1 | 7.4 |
|
||||
| 7.8 | Update documentation | 0 | 4 | 7.6 |
|
||||
|
||||
**Parallel work**: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6.
|
||||
|
||||
**Exit Criteria** (from [06-implementation-phases.md §6.8](./06-implementation-phases.md)):
|
||||
|
||||
- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
|
||||
- [ ] `server=otel` is the default in development docker-compose
|
||||
- [ ] `server=statsd` still works as a fallback
|
||||
- [ ] Existing Grafana dashboards display data correctly
|
||||
- [ ] Integration test passes with OTLP-only metrics pipeline
|
||||
- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead)
|
||||
- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant — Meter mapped to OTel Counter
|
||||
@@ -78,6 +78,13 @@ include(target_link_modules)
|
||||
# Level 01
|
||||
add_module(xrpl beast)
|
||||
target_link_libraries(xrpl.libxrpl.beast PUBLIC xrpl.imports.main)
|
||||
# OTelCollector in beast/insight uses OTel Metrics SDK when telemetry is enabled.
|
||||
if(telemetry)
|
||||
target_link_libraries(
|
||||
xrpl.libxrpl.beast
|
||||
PUBLIC opentelemetry-cpp::opentelemetry-cpp
|
||||
)
|
||||
endif()
|
||||
|
||||
include(GitInfo)
|
||||
add_module(xrpl git)
|
||||
|
||||
@@ -444,21 +444,21 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r
|
||||
| jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
|
||||
```
|
||||
|
||||
### StatsD Metrics (beast::insight)
|
||||
### System Metrics (beast::insight via OTel native)
|
||||
|
||||
rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector
|
||||
on port 8125. These appear in Prometheus alongside spanmetrics.
|
||||
rippled's built-in `beast::insight` framework exports metrics natively via OTLP/HTTP to the OTel Collector
|
||||
on port 4318 (same endpoint as traces). These appear in Prometheus alongside spanmetrics.
|
||||
|
||||
Requires `[insight]` config in `xrpld.cfg`:
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=statsd
|
||||
address=127.0.0.1:8125
|
||||
server=otel
|
||||
endpoint=http://localhost:4318/v1/metrics
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
Verify StatsD metrics in Prometheus:
|
||||
Verify system metrics in Prometheus:
|
||||
|
||||
```bash
|
||||
# Ledger age gauge
|
||||
@@ -477,7 +477,7 @@ curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '
|
||||
curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'
|
||||
```
|
||||
|
||||
Key StatsD metrics (prefix `rippled_`):
|
||||
Key system metrics (prefix `rippled_`):
|
||||
|
||||
| Metric | Type | Source |
|
||||
| ------------------------------------- | --------- | ----------------------------------------- |
|
||||
@@ -514,11 +514,11 @@ Pre-configured dashboards (span-derived):
|
||||
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
|
||||
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
|
||||
|
||||
Pre-configured dashboards (StatsD):
|
||||
Pre-configured dashboards (system metrics):
|
||||
|
||||
- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue
|
||||
- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category
|
||||
- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings
|
||||
- **Node Health (System Metrics)**: Validated/published ledger age, operating mode, I/O latency, job queue
|
||||
- **Network Traffic (System Metrics)**: Peer counts, disconnects, overlay traffic by category
|
||||
- **RPC & Pathfinding (System Metrics)**: RPC request rate/time/size, pathfinding duration, resource warnings
|
||||
|
||||
Pre-configured datasources:
|
||||
|
||||
@@ -575,7 +575,7 @@ Pre-configured datasources:
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [spanmetrics]
|
||||
receivers: [otlp, spanmetrics]
|
||||
exporters: [prometheus]
|
||||
```
|
||||
3. Verify Prometheus can reach collector:
|
||||
|
||||
@@ -26,10 +26,12 @@ services:
|
||||
command: ["--config=/etc/otel-collector-config.yaml"]
|
||||
ports:
|
||||
- "4317:4317" # OTLP gRPC
|
||||
- "4318:4318" # OTLP HTTP
|
||||
- "8125:8125/udp" # StatsD UDP (beast::insight metrics)
|
||||
- "8889:8889" # Prometheus metrics (spanmetrics + statsd)
|
||||
- "4318:4318" # OTLP HTTP (traces + native OTel metrics)
|
||||
- "8889:8889" # Prometheus metrics (spanmetrics + OTLP)
|
||||
- "13133:13133" # Health check
|
||||
# StatsD UDP port removed — beast::insight now uses native OTLP.
|
||||
# Uncomment if using server=statsd fallback:
|
||||
# - "8125:8125/udp"
|
||||
volumes:
|
||||
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
|
||||
depends_on:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Ledger data exchange and object fetch traffic from beast::insight StatsD. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=statsd in rippled config.",
|
||||
"description": "Ledger data exchange and object fetch traffic from beast::insight System Metrics. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=otel in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
@@ -30,57 +30,57 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_get_Bytes_In",
|
||||
"legendFormat": "Ledger Data Get"
|
||||
"expr": "rippled_ledger_data_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Data Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_share_Bytes_In",
|
||||
"legendFormat": "Ledger Data Share"
|
||||
"expr": "rippled_ledger_data_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Data Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In",
|
||||
"legendFormat": "TX Set Candidate Get"
|
||||
"expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Set Candidate Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In",
|
||||
"legendFormat": "TX Set Candidate Share"
|
||||
"expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Set Candidate Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In",
|
||||
"legendFormat": "TX Node Get"
|
||||
"expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In",
|
||||
"legendFormat": "TX Node Share"
|
||||
"expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In",
|
||||
"legendFormat": "Account State Node Get"
|
||||
"expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Node Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In",
|
||||
"legendFormat": "Account State Node Share"
|
||||
"expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Node Share [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -118,57 +118,57 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_share_Bytes_In",
|
||||
"legendFormat": "Ledger Share In"
|
||||
"expr": "rippled_ledger_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Share In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_get_Bytes_In",
|
||||
"legendFormat": "Ledger Get In"
|
||||
"expr": "rippled_ledger_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Get In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In",
|
||||
"legendFormat": "TX Set Candidate Share"
|
||||
"expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Set Candidate Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In",
|
||||
"legendFormat": "TX Set Candidate Get"
|
||||
"expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Set Candidate Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Transaction_node_share_Bytes_In",
|
||||
"legendFormat": "TX Node Share"
|
||||
"expr": "rippled_ledger_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Transaction_node_get_Bytes_In",
|
||||
"legendFormat": "TX Node Get"
|
||||
"expr": "rippled_ledger_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Account_State_node_share_Bytes_In",
|
||||
"legendFormat": "Account State Share"
|
||||
"expr": "rippled_ledger_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ledger_Account_State_node_get_Bytes_In",
|
||||
"legendFormat": "Account State Get"
|
||||
"expr": "rippled_ledger_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Get [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -206,57 +206,57 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Ledger_get_Bytes_In",
|
||||
"legendFormat": "Ledger Get"
|
||||
"expr": "rippled_getobject_Ledger_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Ledger_share_Bytes_In",
|
||||
"legendFormat": "Ledger Share"
|
||||
"expr": "rippled_getobject_Ledger_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_get_Bytes_In",
|
||||
"legendFormat": "Transaction Get"
|
||||
"expr": "rippled_getobject_Transaction_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Transaction Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_share_Bytes_In",
|
||||
"legendFormat": "Transaction Share"
|
||||
"expr": "rippled_getobject_Transaction_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Transaction Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_node_get_Bytes_In",
|
||||
"legendFormat": "TX Node Get"
|
||||
"expr": "rippled_getobject_Transaction_node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_node_share_Bytes_In",
|
||||
"legendFormat": "TX Node Share"
|
||||
"expr": "rippled_getobject_Transaction_node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Account_State_node_get_Bytes_In",
|
||||
"legendFormat": "Account State Get"
|
||||
"expr": "rippled_getobject_Account_State_node_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Account_State_node_share_Bytes_In",
|
||||
"legendFormat": "Account State Share"
|
||||
"expr": "rippled_getobject_Account_State_node_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Share [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -294,50 +294,50 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_CAS_get_Bytes_In",
|
||||
"legendFormat": "CAS Get"
|
||||
"expr": "rippled_getobject_CAS_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "CAS Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_CAS_share_Bytes_In",
|
||||
"legendFormat": "CAS Share"
|
||||
"expr": "rippled_getobject_CAS_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "CAS Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Fetch_Pack_share_Bytes_In",
|
||||
"legendFormat": "Fetch Pack Share"
|
||||
"expr": "rippled_getobject_Fetch_Pack_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Fetch Pack Share [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Fetch_Pack_get_Bytes_In",
|
||||
"legendFormat": "Fetch Pack Get"
|
||||
"expr": "rippled_getobject_Fetch_Pack_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Fetch Pack Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transactions_get_Bytes_In",
|
||||
"legendFormat": "Transactions Get"
|
||||
"expr": "rippled_getobject_Transactions_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Transactions Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_get_Bytes_In",
|
||||
"legendFormat": "Aggregate Get"
|
||||
"expr": "rippled_getobject_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Aggregate Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_share_Bytes_In",
|
||||
"legendFormat": "Aggregate Share"
|
||||
"expr": "rippled_getobject_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Aggregate Share [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -375,55 +375,55 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Ledger_get_Messages_In",
|
||||
"legendFormat": "Ledger Get"
|
||||
"expr": "rippled_getobject_Ledger_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ledger Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_get_Messages_In",
|
||||
"legendFormat": "Transaction Get"
|
||||
"expr": "rippled_getobject_Transaction_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Transaction Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transaction_node_get_Messages_In",
|
||||
"legendFormat": "TX Node Get"
|
||||
"expr": "rippled_getobject_Transaction_node_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Node Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Account_State_node_get_Messages_In",
|
||||
"legendFormat": "Account State Get"
|
||||
"expr": "rippled_getobject_Account_State_node_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Account State Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_CAS_get_Messages_In",
|
||||
"legendFormat": "CAS Get"
|
||||
"expr": "rippled_getobject_CAS_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "CAS Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Fetch_Pack_get_Messages_In",
|
||||
"legendFormat": "Fetch Pack Get"
|
||||
"expr": "rippled_getobject_Fetch_Pack_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Fetch Pack Get [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_getobject_Transactions_get_Messages_In",
|
||||
"legendFormat": "Transactions Get"
|
||||
"expr": "rippled_getobject_Transactions_get_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Transactions Get [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages In",
|
||||
"spanNulls": true,
|
||||
@@ -463,8 +463,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "topk(20, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}}"
|
||||
"expr": "topk(20, {exported_instance=~\"$node\", __name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}} [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -495,12 +495,33 @@
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "ledger", "sync", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
"list": [
|
||||
{
|
||||
"name": "node",
|
||||
"label": "Node",
|
||||
"description": "Filter by rippled node (service.instance.id)",
|
||||
"type": "query",
|
||||
"query": "label_values(rippled_ledger_data_get_Bytes_In, exported_instance)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"text": "All",
|
||||
"value": "$__all"
|
||||
},
|
||||
"multi": true,
|
||||
"refresh": 2,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "Ledger Data & Sync (StatsD)",
|
||||
"uid": "rippled-statsd-ledger-sync"
|
||||
"title": "Ledger Data & Sync (System Metrics)",
|
||||
"uid": "rippled-system-ledger-sync"
|
||||
}
|
||||
@@ -2,7 +2,7 @@
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"description": "Network traffic and peer metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
@@ -30,20 +30,20 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_Peer_Finder_Active_Inbound_Peers",
|
||||
"legendFormat": "Inbound Peers"
|
||||
"expr": "rippled_Peer_Finder_Active_Inbound_Peers{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Inbound Peers [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_Peer_Finder_Active_Outbound_Peers",
|
||||
"legendFormat": "Outbound Peers"
|
||||
"expr": "rippled_Peer_Finder_Active_Outbound_Peers{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Outbound Peers [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Peers",
|
||||
"spanNulls": true,
|
||||
@@ -76,13 +76,13 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_Overlay_Peer_Disconnects",
|
||||
"legendFormat": "Disconnects"
|
||||
"expr": "rippled_Overlay_Peer_Disconnects{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Disconnects [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Disconnects",
|
||||
"spanNulls": true,
|
||||
@@ -115,15 +115,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_total_Bytes_In",
|
||||
"legendFormat": "Bytes In"
|
||||
"expr": "rippled_total_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_total_Bytes_Out",
|
||||
"legendFormat": "Bytes Out"
|
||||
"expr": "rippled_total_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Bytes Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -161,20 +161,20 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_total_Messages_In",
|
||||
"legendFormat": "Messages In"
|
||||
"expr": "rippled_total_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Messages In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_total_Messages_Out",
|
||||
"legendFormat": "Messages Out"
|
||||
"expr": "rippled_total_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Messages Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -207,27 +207,27 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_transactions_Messages_In",
|
||||
"legendFormat": "TX Messages In"
|
||||
"expr": "rippled_transactions_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Messages In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_transactions_Messages_Out",
|
||||
"legendFormat": "TX Messages Out"
|
||||
"expr": "rippled_transactions_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Messages Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_transactions_duplicate_Messages_In",
|
||||
"legendFormat": "TX Duplicate In"
|
||||
"expr": "rippled_transactions_duplicate_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "TX Duplicate In [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -260,34 +260,34 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proposals_Messages_In",
|
||||
"legendFormat": "Proposals In"
|
||||
"expr": "rippled_proposals_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Proposals In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proposals_Messages_Out",
|
||||
"legendFormat": "Proposals Out"
|
||||
"expr": "rippled_proposals_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Proposals Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proposals_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
"expr": "rippled_proposals_untrusted_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Untrusted In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proposals_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
"expr": "rippled_proposals_duplicate_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Duplicate In [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -320,34 +320,34 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validations_Messages_In",
|
||||
"legendFormat": "Validations In"
|
||||
"expr": "rippled_validations_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Validations In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validations_Messages_Out",
|
||||
"legendFormat": "Validations Out"
|
||||
"expr": "rippled_validations_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Validations Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validations_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
"expr": "rippled_validations_untrusted_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Untrusted In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validations_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
"expr": "rippled_validations_duplicate_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Duplicate In [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -380,8 +380,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}}"
|
||||
"expr": "topk(10, {exported_instance=~\"$node\", __name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}} [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -660,12 +660,33 @@
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "network", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
"list": [
|
||||
{
|
||||
"name": "node",
|
||||
"label": "Node",
|
||||
"description": "Filter by rippled node (service.instance.id)",
|
||||
"type": "query",
|
||||
"query": "label_values(rippled_Peer_Finder_Active_Inbound_Peers, exported_instance)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"text": "All",
|
||||
"value": "$__all"
|
||||
},
|
||||
"multi": true,
|
||||
"refresh": 2,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "Network Traffic (StatsD)",
|
||||
"uid": "rippled-statsd-network"
|
||||
"title": "Network Traffic (System Metrics)",
|
||||
"uid": "rippled-system-network"
|
||||
}
|
||||
@@ -2,7 +2,7 @@
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"description": "Node health metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
@@ -30,8 +30,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Validated_Ledger_Age",
|
||||
"legendFormat": "Validated Age"
|
||||
"expr": "rippled_LedgerMaster_Validated_Ledger_Age{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Validated Age [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -78,8 +78,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Published_Ledger_Age",
|
||||
"legendFormat": "Published Age"
|
||||
"expr": "rippled_LedgerMaster_Published_Ledger_Age{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Published Age [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -107,7 +107,7 @@
|
||||
},
|
||||
{
|
||||
"title": "Operating Mode Duration",
|
||||
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.",
|
||||
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778) which report microseconds. A healthy node should spend the vast majority of time in Full mode.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
@@ -126,43 +126,43 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_duration",
|
||||
"legendFormat": "Full"
|
||||
"expr": "rippled_State_Accounting_Full_duration{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Full [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_duration",
|
||||
"legendFormat": "Tracking"
|
||||
"expr": "rippled_State_Accounting_Tracking_duration{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Tracking [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_duration",
|
||||
"legendFormat": "Syncing"
|
||||
"expr": "rippled_State_Accounting_Syncing_duration{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Syncing [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_duration",
|
||||
"legendFormat": "Connected"
|
||||
"expr": "rippled_State_Accounting_Connected_duration{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Connected [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_duration",
|
||||
"legendFormat": "Disconnected"
|
||||
"expr": "rippled_State_Accounting_Disconnected_duration{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Disconnected [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"unit": "µs",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (Sec)",
|
||||
"axisLabel": "Duration",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
@@ -193,41 +193,41 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_transitions",
|
||||
"legendFormat": "Full"
|
||||
"expr": "rippled_State_Accounting_Full_transitions{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Full [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_transitions",
|
||||
"legendFormat": "Tracking"
|
||||
"expr": "rippled_State_Accounting_Tracking_transitions{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Tracking [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_transitions",
|
||||
"legendFormat": "Syncing"
|
||||
"expr": "rippled_State_Accounting_Syncing_transitions{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Syncing [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_transitions",
|
||||
"legendFormat": "Connected"
|
||||
"expr": "rippled_State_Accounting_Connected_transitions{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Connected [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_transitions",
|
||||
"legendFormat": "Disconnected"
|
||||
"expr": "rippled_State_Accounting_Disconnected_transitions{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Disconnected [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Transitions",
|
||||
"spanNulls": true,
|
||||
@@ -260,15 +260,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 I/O Latency"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_ios_latency_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 I/O Latency [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 I/O Latency"
|
||||
"expr": "histogram_quantile(0.50, sum by (le, exported_instance) (rate(rippled_ios_latency_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 I/O Latency [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -287,7 +287,7 @@
|
||||
},
|
||||
{
|
||||
"title": "Job Queue Depth",
|
||||
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
|
||||
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough — common during ledger replay or heavy RPC load.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
@@ -306,13 +306,13 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_job_count",
|
||||
"legendFormat": "Job Queue Depth"
|
||||
"expr": "rippled_job_count{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Job Queue Depth [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Jobs",
|
||||
"spanNulls": true,
|
||||
@@ -345,8 +345,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_fetches_total[5m])",
|
||||
"legendFormat": "Fetches / Sec"
|
||||
"expr": "rate(rippled_ledger_fetches_total{exported_instance=~\"$node\"}[5m])",
|
||||
"legendFormat": "Fetches / Sec [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -377,8 +377,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_history_mismatch_total[5m])",
|
||||
"legendFormat": "Mismatches / Sec"
|
||||
"expr": "rate(rippled_ledger_history_mismatch_total{exported_instance=~\"$node\"}[5m])",
|
||||
"legendFormat": "Mismatches / Sec [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -404,12 +404,33 @@
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "node-health", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
"list": [
|
||||
{
|
||||
"name": "node",
|
||||
"label": "Node",
|
||||
"description": "Filter by rippled node (service.instance.id)",
|
||||
"type": "query",
|
||||
"query": "label_values(rippled_LedgerMaster_Validated_Ledger_Age, exported_instance)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"text": "All",
|
||||
"value": "$__all"
|
||||
},
|
||||
"multi": true,
|
||||
"refresh": 2,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "Node Health (StatsD)",
|
||||
"uid": "rippled-statsd-node-health"
|
||||
"title": "Node Health (System Metrics)",
|
||||
"uid": "rippled-system-node-health"
|
||||
}
|
||||
@@ -2,7 +2,7 @@
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=statsd in rippled config.",
|
||||
"description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=otel in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
@@ -30,48 +30,48 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_Messages_In",
|
||||
"legendFormat": "Squelch In"
|
||||
"expr": "rippled_squelch_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Squelch In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_Messages_Out",
|
||||
"legendFormat": "Squelch Out"
|
||||
"expr": "rippled_squelch_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Squelch Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_suppressed_Messages_In",
|
||||
"legendFormat": "Suppressed In"
|
||||
"expr": "rippled_squelch_suppressed_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Suppressed In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_suppressed_Messages_Out",
|
||||
"legendFormat": "Suppressed Out"
|
||||
"expr": "rippled_squelch_suppressed_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Suppressed Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_ignored_Messages_In",
|
||||
"legendFormat": "Ignored In"
|
||||
"expr": "rippled_squelch_ignored_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ignored In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_squelch_ignored_Messages_Out",
|
||||
"legendFormat": "Ignored Out"
|
||||
"expr": "rippled_squelch_ignored_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Ignored Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -104,43 +104,43 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_Bytes_In",
|
||||
"legendFormat": "Base Overhead In"
|
||||
"expr": "rippled_overhead_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Base Overhead In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_Bytes_Out",
|
||||
"legendFormat": "Base Overhead Out"
|
||||
"expr": "rippled_overhead_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Base Overhead Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_cluster_Bytes_In",
|
||||
"legendFormat": "Cluster In"
|
||||
"expr": "rippled_overhead_cluster_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Cluster In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_cluster_Bytes_Out",
|
||||
"legendFormat": "Cluster Out"
|
||||
"expr": "rippled_overhead_cluster_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Cluster Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_manifest_Bytes_In",
|
||||
"legendFormat": "Manifest In"
|
||||
"expr": "rippled_overhead_manifest_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Manifest In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_overhead_manifest_Bytes_Out",
|
||||
"legendFormat": "Manifest Out"
|
||||
"expr": "rippled_overhead_manifest_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Manifest Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -178,34 +178,34 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validator_lists_Bytes_In",
|
||||
"legendFormat": "Bytes In"
|
||||
"expr": "rippled_validator_lists_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validator_lists_Bytes_Out",
|
||||
"legendFormat": "Bytes Out"
|
||||
"expr": "rippled_validator_lists_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Bytes Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validator_lists_Messages_In",
|
||||
"legendFormat": "Messages In"
|
||||
"expr": "rippled_validator_lists_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Messages In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_validator_lists_Messages_Out",
|
||||
"legendFormat": "Messages Out"
|
||||
"expr": "rippled_validator_lists_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Messages Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Count",
|
||||
"spanNulls": true,
|
||||
@@ -255,29 +255,29 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_set_get_Bytes_In",
|
||||
"legendFormat": "Set Get In"
|
||||
"expr": "rippled_set_get_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Set Get In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_set_get_Bytes_Out",
|
||||
"legendFormat": "Set Get Out"
|
||||
"expr": "rippled_set_get_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Set Get Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_set_share_Bytes_In",
|
||||
"legendFormat": "Set Share In"
|
||||
"expr": "rippled_set_share_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Set Share In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_set_share_Bytes_Out",
|
||||
"legendFormat": "Set Share Out"
|
||||
"expr": "rippled_set_share_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Set Share Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -315,34 +315,34 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_have_transactions_Messages_In",
|
||||
"legendFormat": "Have TX In"
|
||||
"expr": "rippled_have_transactions_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Have TX In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_have_transactions_Messages_Out",
|
||||
"legendFormat": "Have TX Out"
|
||||
"expr": "rippled_have_transactions_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Have TX Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_requested_transactions_Messages_In",
|
||||
"legendFormat": "Requested TX In"
|
||||
"expr": "rippled_requested_transactions_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Requested TX In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_requested_transactions_Messages_Out",
|
||||
"legendFormat": "Requested TX Out"
|
||||
"expr": "rippled_requested_transactions_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Requested TX Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
@@ -375,34 +375,34 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_unknown_Bytes_In",
|
||||
"legendFormat": "Unknown Bytes In"
|
||||
"expr": "rippled_unknown_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Unknown Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_unknown_Bytes_Out",
|
||||
"legendFormat": "Unknown Bytes Out"
|
||||
"expr": "rippled_unknown_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Unknown Bytes Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_unknown_Messages_In",
|
||||
"legendFormat": "Unknown Messages In"
|
||||
"expr": "rippled_unknown_Messages_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Unknown Messages In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_unknown_Messages_Out",
|
||||
"legendFormat": "Unknown Messages Out"
|
||||
"expr": "rippled_unknown_Messages_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Unknown Messages Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"unit": "none",
|
||||
"custom": {
|
||||
"axisLabel": "Count",
|
||||
"spanNulls": true,
|
||||
@@ -452,29 +452,29 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proof_path_request_Bytes_In",
|
||||
"legendFormat": "Request Bytes In"
|
||||
"expr": "rippled_proof_path_request_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Request Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proof_path_request_Bytes_Out",
|
||||
"legendFormat": "Request Bytes Out"
|
||||
"expr": "rippled_proof_path_request_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Request Bytes Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proof_path_response_Bytes_In",
|
||||
"legendFormat": "Response Bytes In"
|
||||
"expr": "rippled_proof_path_response_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Response Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_proof_path_response_Bytes_Out",
|
||||
"legendFormat": "Response Bytes Out"
|
||||
"expr": "rippled_proof_path_response_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Response Bytes Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -512,29 +512,29 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_replay_delta_request_Bytes_In",
|
||||
"legendFormat": "Request Bytes In"
|
||||
"expr": "rippled_replay_delta_request_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Request Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_replay_delta_request_Bytes_Out",
|
||||
"legendFormat": "Request Bytes Out"
|
||||
"expr": "rippled_replay_delta_request_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Request Bytes Out [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_replay_delta_response_Bytes_In",
|
||||
"legendFormat": "Response Bytes In"
|
||||
"expr": "rippled_replay_delta_response_Bytes_In{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Response Bytes In [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_replay_delta_response_Bytes_Out",
|
||||
"legendFormat": "Response Bytes Out"
|
||||
"expr": "rippled_replay_delta_response_Bytes_Out{exported_instance=~\"$node\"}",
|
||||
"legendFormat": "Response Bytes Out [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -555,12 +555,33 @@
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "overlay", "network", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
"list": [
|
||||
{
|
||||
"name": "node",
|
||||
"label": "Node",
|
||||
"description": "Filter by rippled node (service.instance.id)",
|
||||
"type": "query",
|
||||
"query": "label_values(rippled_squelch_Messages_In, exported_instance)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"text": "All",
|
||||
"value": "$__all"
|
||||
},
|
||||
"multi": true,
|
||||
"refresh": 2,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "Overlay Traffic Detail (StatsD)",
|
||||
"uid": "rippled-statsd-overlay-detail"
|
||||
"title": "Overlay Traffic Detail (System Metrics)",
|
||||
"uid": "rippled-system-overlay-detail"
|
||||
}
|
||||
@@ -2,7 +2,7 @@
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"description": "RPC and pathfinding metrics from beast::insight System Metrics. Requires [insight] server=otel in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
@@ -10,7 +10,7 @@
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "RPC Request Rate (StatsD)",
|
||||
"title": "RPC Request Rate (System Metrics)",
|
||||
"description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
@@ -30,8 +30,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_rpc_requests_total[5m])",
|
||||
"legendFormat": "Requests / Sec"
|
||||
"expr": "rate(rippled_rpc_requests_total{exported_instance=~\"$node\"}[5m])",
|
||||
"legendFormat": "Requests / Sec [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -42,7 +42,7 @@
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Time (StatsD)",
|
||||
"title": "RPC Response Time (System Metrics)",
|
||||
"description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
@@ -62,15 +62,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Time"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 Response Time [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Time"
|
||||
"expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 Response Time [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -108,15 +108,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Size"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_size_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 Response Size [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Size"
|
||||
"expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_size_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 Response Size [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -154,29 +154,29 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50"
|
||||
"expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.9\"}",
|
||||
"legendFormat": "P90"
|
||||
"expr": "histogram_quantile(0.9, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P90 [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.99\"}",
|
||||
"legendFormat": "P99"
|
||||
"expr": "histogram_quantile(0.99, sum by (le, exported_instance) (rate(rippled_rpc_time_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P99 [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -214,15 +214,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Fast Pathfind"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_pathfind_fast_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 Fast Pathfind [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Fast Pathfind"
|
||||
"expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_pathfind_fast_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 Fast Pathfind [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -260,15 +260,15 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Full Pathfind"
|
||||
"expr": "histogram_quantile(0.95, sum by (le, exported_instance) (rate(rippled_pathfind_full_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P95 Full Pathfind [{{exported_instance}}]"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Full Pathfind"
|
||||
"expr": "histogram_quantile(0.5, sum by (le, exported_instance) (rate(rippled_pathfind_full_bucket{exported_instance=~\"$node\"}[5m])))",
|
||||
"legendFormat": "P50 Full Pathfind [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -287,7 +287,7 @@
|
||||
},
|
||||
{
|
||||
"title": "Resource Warnings Rate",
|
||||
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in System MetricsCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
@@ -306,8 +306,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_warn_total[5m])",
|
||||
"legendFormat": "Warnings / Sec"
|
||||
"expr": "rate(rippled_warn_total{exported_instance=~\"$node\"}[5m])",
|
||||
"legendFormat": "Warnings / Sec [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -335,7 +335,7 @@
|
||||
},
|
||||
{
|
||||
"title": "Resource Drops Rate",
|
||||
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in System MetricsCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
@@ -354,8 +354,8 @@
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_drop_total[5m])",
|
||||
"legendFormat": "Drops / Sec"
|
||||
"expr": "rate(rippled_drop_total{exported_instance=~\"$node\"}[5m])",
|
||||
"legendFormat": "Drops / Sec [{{exported_instance}}]"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -385,12 +385,33 @@
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
"list": [
|
||||
{
|
||||
"name": "node",
|
||||
"label": "Node",
|
||||
"description": "Filter by rippled node (service.instance.id)",
|
||||
"type": "query",
|
||||
"query": "label_values(rippled_rpc_requests_total, exported_instance)",
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"includeAll": true,
|
||||
"allValue": ".*",
|
||||
"current": {
|
||||
"text": "All",
|
||||
"value": "$__all"
|
||||
},
|
||||
"multi": true,
|
||||
"refresh": 2,
|
||||
"sort": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "RPC & Pathfinding (StatsD)",
|
||||
"uid": "rippled-statsd-rpc"
|
||||
"title": "RPC & Pathfinding (System Metrics)",
|
||||
"uid": "rippled-system-rpc"
|
||||
}
|
||||
@@ -312,8 +312,8 @@ trace_peer=1
|
||||
trace_ledger=1
|
||||
|
||||
[insight]
|
||||
server=statsd
|
||||
address=127.0.0.1:8125
|
||||
server=otel
|
||||
endpoint=http://localhost:4318/v1/metrics
|
||||
prefix=rippled
|
||||
|
||||
[rpc_startup]
|
||||
@@ -539,42 +539,52 @@ else
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 10b: Verify StatsD metrics in Prometheus
|
||||
# Step 10b: Verify native OTel metrics in Prometheus (beast::insight)
|
||||
# ---------------------------------------------------------------------------
|
||||
log ""
|
||||
log "--- Phase 6: StatsD Metrics (beast::insight) ---"
|
||||
log "Waiting 20s for StatsD aggregation + Prometheus scrape..."
|
||||
log "--- Phase 7: Native OTel Metrics (beast::insight via OTLP) ---"
|
||||
log "Waiting 20s for OTLP metric export + Prometheus scrape..."
|
||||
sleep 20
|
||||
|
||||
check_statsd_metric() {
|
||||
check_otel_metric() {
|
||||
local metric_name="$1"
|
||||
local result
|
||||
result=$(curl -sf "$PROM/api/v1/query?query=$metric_name" \
|
||||
| jq '.data.result | length' 2>/dev/null || echo 0)
|
||||
if [ "$result" -gt 0 ]; then
|
||||
ok "StatsD: $metric_name ($result series)"
|
||||
ok "OTel: $metric_name ($result series)"
|
||||
else
|
||||
fail "StatsD: $metric_name (0 series)"
|
||||
fail "OTel: $metric_name (0 series)"
|
||||
fi
|
||||
}
|
||||
|
||||
# Node health gauges
|
||||
check_statsd_metric "rippled_LedgerMaster_Validated_Ledger_Age"
|
||||
check_statsd_metric "rippled_LedgerMaster_Published_Ledger_Age"
|
||||
check_statsd_metric "rippled_job_count"
|
||||
# Node health gauges (ObservableGauge — no _total suffix)
|
||||
check_otel_metric "rippled_LedgerMaster_Validated_Ledger_Age"
|
||||
check_otel_metric "rippled_LedgerMaster_Published_Ledger_Age"
|
||||
check_otel_metric "rippled_job_count"
|
||||
|
||||
# State accounting
|
||||
check_statsd_metric "rippled_State_Accounting_Full_duration"
|
||||
check_otel_metric "rippled_State_Accounting_Full_duration"
|
||||
|
||||
# Peer finder
|
||||
check_statsd_metric "rippled_Peer_Finder_Active_Inbound_Peers"
|
||||
check_statsd_metric "rippled_Peer_Finder_Active_Outbound_Peers"
|
||||
check_otel_metric "rippled_Peer_Finder_Active_Inbound_Peers"
|
||||
check_otel_metric "rippled_Peer_Finder_Active_Outbound_Peers"
|
||||
|
||||
# RPC counters (only if RPC was exercised — should be true from Steps 5-8)
|
||||
check_statsd_metric "rippled_rpc_requests"
|
||||
# RPC counters (Counter — Prometheus adds _total suffix automatically)
|
||||
check_otel_metric "rippled_rpc_requests_total"
|
||||
|
||||
# Overlay traffic
|
||||
check_statsd_metric "rippled_total_Bytes_In"
|
||||
check_otel_metric "rippled_total_Bytes_In"
|
||||
|
||||
# Verify StatsD receiver is NOT required (no statsd receiver in pipeline)
|
||||
log ""
|
||||
log "--- Verify StatsD receiver is not required ---"
|
||||
statsd_port_check=$(curl -sf "http://localhost:8125" 2>&1 || echo "refused")
|
||||
if echo "$statsd_port_check" | grep -qi "refused\|error\|connection"; then
|
||||
ok "StatsD port 8125 is not listening (not required)"
|
||||
else
|
||||
fail "StatsD port 8125 appears to be listening (should not be needed)"
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 11: Summary
|
||||
|
||||
@@ -2,22 +2,21 @@
|
||||
#
|
||||
# Pipelines:
|
||||
# traces: OTLP receiver -> batch processor -> debug + Jaeger + Tempo + spanmetrics
|
||||
# metrics: spanmetrics connector + StatsD receiver -> Prometheus exporter
|
||||
# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter
|
||||
#
|
||||
# rippled sends traces via OTLP/HTTP to port 4318. The collector batches
|
||||
# them, forwards to both Jaeger and Tempo, and derives RED metrics via the
|
||||
# spanmetrics connector, which Prometheus scrapes on port 8889.
|
||||
#
|
||||
# rippled also sends beast::insight metrics via StatsD/UDP to port 8125.
|
||||
# These are ingested by the statsd receiver and merged into the same
|
||||
# Prometheus endpoint alongside span-derived metrics.
|
||||
# rippled sends beast::insight metrics natively via OTLP/HTTP to port 4318
|
||||
# (same endpoint as traces). The OTLP receiver feeds both the traces and
|
||||
# metrics pipelines. Metrics are exported to Prometheus alongside
|
||||
# span-derived metrics.
|
||||
#
|
||||
# TODO: The Resource Manager's "warn" and "drop" metrics use the non-standard
|
||||
# "|m" (meter) StatsD type in StatsDCollector.cpp:706. The OTel StatsD
|
||||
# receiver silently drops "|m" metrics since it only recognizes standard
|
||||
# types (|c, |g, |ms, |h, |s). To capture these two metrics, change "|m"
|
||||
# to "|c" in StatsDCollector.cpp — this is a breaking change for any
|
||||
# backend that relied on the custom "|m" type. Tracked as Phase 6 Task 6.1.
|
||||
# For backward compatibility, the StatsD receiver config is preserved below
|
||||
# but commented out. If you need StatsD fallback (server=statsd in
|
||||
# [insight]), uncomment the statsd receiver and add it to the metrics
|
||||
# pipeline receivers list.
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
@@ -26,20 +25,22 @@ receivers:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
statsd:
|
||||
endpoint: "0.0.0.0:8125"
|
||||
aggregation_interval: 15s
|
||||
enable_metric_type: true
|
||||
is_monotonic_counter: true
|
||||
timer_histogram_mapping:
|
||||
- statsd_type: "timing"
|
||||
observer_type: "summary"
|
||||
summary:
|
||||
percentiles: [0, 50, 90, 95, 99, 100]
|
||||
- statsd_type: "histogram"
|
||||
observer_type: "summary"
|
||||
summary:
|
||||
percentiles: [0, 50, 90, 95, 99, 100]
|
||||
# StatsD receiver — kept for backward compatibility with server=statsd.
|
||||
# Uncomment and add "statsd" to metrics pipeline receivers if needed.
|
||||
# statsd:
|
||||
# endpoint: "0.0.0.0:8125"
|
||||
# aggregation_interval: 15s
|
||||
# enable_metric_type: true
|
||||
# is_monotonic_counter: true
|
||||
# timer_histogram_mapping:
|
||||
# - statsd_type: "timing"
|
||||
# observer_type: "summary"
|
||||
# summary:
|
||||
# percentiles: [0, 50, 90, 95, 99, 100]
|
||||
# - statsd_type: "histogram"
|
||||
# observer_type: "summary"
|
||||
# summary:
|
||||
# percentiles: [0, 50, 90, 95, 99, 100]
|
||||
|
||||
processors:
|
||||
batch:
|
||||
@@ -84,5 +85,6 @@ service:
|
||||
processors: [batch]
|
||||
exporters: [debug, otlp/jaeger, otlp/tempo, spanmetrics]
|
||||
metrics:
|
||||
receivers: [spanmetrics, statsd]
|
||||
receivers: [otlp, spanmetrics]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
|
||||
@@ -161,14 +161,27 @@ Configured in `otel-collector-config.yaml`:
|
||||
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
|
||||
```
|
||||
|
||||
## StatsD Metrics (beast::insight)
|
||||
## System Metrics (beast::insight via OTel native)
|
||||
|
||||
rippled has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
|
||||
rippled has a built-in metrics framework (`beast::insight`) that exports metrics natively via OTLP/HTTP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
|
||||
|
||||
### Configuration
|
||||
|
||||
Add to `xrpld.cfg`:
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=otel
|
||||
endpoint=http://localhost:4318/v1/metrics
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
The OTel Collector receives these via the OTLP receiver (same endpoint as traces, port 4318) and exports them to Prometheus alongside spanmetrics.
|
||||
|
||||
#### StatsD fallback (backward compatibility)
|
||||
|
||||
The legacy StatsD backend is still available:
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=statsd
|
||||
@@ -176,7 +189,7 @@ address=127.0.0.1:8125
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics.
|
||||
When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yaml` and add port `8125:8125/udp` to the docker-compose otel-collector service.
|
||||
|
||||
### Metric Reference
|
||||
|
||||
@@ -284,7 +297,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
|
||||
| Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` |
|
||||
| Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |
|
||||
|
||||
### Node Health — StatsD (`rippled-statsd-node-health`)
|
||||
### Node Health — System Metrics (`rippled-system-node-health`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| -------------------------- | ---------- | ------------------------------------------------------ | ----------- |
|
||||
@@ -297,7 +310,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
|
||||
| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — |
|
||||
| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — |
|
||||
|
||||
### Network Traffic — StatsD (`rippled-statsd-network`)
|
||||
### Network Traffic — System Metrics (`rippled-system-network`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ---------------------- | ---------- | -------------------------------------- | ----------- |
|
||||
@@ -310,7 +323,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
|
||||
| Validation Traffic | timeseries | `rippled_validations_Messages_In/Out` | — |
|
||||
| Traffic by Category | bargauge | `topk(10, rippled_*_Bytes_In)` | — |
|
||||
|
||||
### RPC & Pathfinding — StatsD (`rippled-statsd-rpc`)
|
||||
### RPC & Pathfinding — System Metrics (`rippled-system-rpc`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ------------------------- | ---------- | -------------------------------------------------------- | ----------- |
|
||||
@@ -354,6 +367,14 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
|
||||
3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces`
|
||||
4. Check collector logs: `docker compose logs otel-collector`
|
||||
|
||||
### No system metrics in Prometheus
|
||||
|
||||
1. Check rippled logs for `OTelCollector starting` message
|
||||
2. Verify `server=otel` in the `[insight]` config section
|
||||
3. Verify the endpoint in `[insight]` points to the OTLP/HTTP port (default: `http://localhost:4318/v1/metrics`)
|
||||
4. Check that the `otlp` receiver is in the metrics pipeline receivers in `otel-collector-config.yaml`
|
||||
5. Query Prometheus directly: `curl 'http://localhost:9090/api/v1/query?query=rippled_job_count'`
|
||||
|
||||
### High memory usage
|
||||
|
||||
- Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling)
|
||||
|
||||
@@ -12,4 +12,5 @@
|
||||
#include <xrpl/beast/insight/Hook.h>
|
||||
#include <xrpl/beast/insight/HookImpl.h>
|
||||
#include <xrpl/beast/insight/NullCollector.h>
|
||||
#include <xrpl/beast/insight/OTelCollector.h>
|
||||
#include <xrpl/beast/insight/StatsDCollector.h>
|
||||
|
||||
92
include/xrpl/beast/insight/OTelCollector.h
Normal file
92
include/xrpl/beast/insight/OTelCollector.h
Normal file
@@ -0,0 +1,92 @@
|
||||
#pragma once
|
||||
|
||||
/**
|
||||
* @file OTelCollector.h
|
||||
* @brief OpenTelemetry-based implementation of the beast::insight::Collector
|
||||
* interface for native OTLP metric export.
|
||||
*
|
||||
* When XRPL_ENABLE_TELEMETRY is defined, OTelCollector maps each
|
||||
* beast::insight instrument type (Counter, Gauge, Event, Meter, Hook) to
|
||||
* the corresponding OpenTelemetry Metrics SDK instrument and exports
|
||||
* them via OTLP/HTTP to an OpenTelemetry Collector.
|
||||
*
|
||||
* When XRPL_ENABLE_TELEMETRY is NOT defined, OTelCollector::New() returns
|
||||
* a NullCollector so the binary compiles without OTel dependencies.
|
||||
*
|
||||
* Dependency diagram:
|
||||
*
|
||||
* +-----------------+ +-------------------+
|
||||
* | Collector (ABC) |<----| OTelCollector |
|
||||
* +-----------------+ | (public header) |
|
||||
* ^ +-------------------+
|
||||
* | |
|
||||
* +-----------------+ +-------------------+
|
||||
* | NullCollector | | OTelCollectorImp |
|
||||
* | (fallback when | | (impl in .cpp, |
|
||||
* | no telemetry) | | uses OTel SDK) |
|
||||
* +-----------------+ +-------------------+
|
||||
* |
|
||||
* +-------------------+
|
||||
* | OTel Metrics SDK |
|
||||
* | MeterProvider |
|
||||
* | OTLP HTTP Metric |
|
||||
* | Exporter |
|
||||
* +-------------------+
|
||||
*/
|
||||
|
||||
#include <xrpl/beast/insight/Collector.h>
|
||||
#include <xrpl/beast/utility/Journal.h>
|
||||
|
||||
#include <memory>
|
||||
#include <string>
|
||||
|
||||
namespace beast {
|
||||
namespace insight {
|
||||
|
||||
/**
|
||||
* @brief A Collector that exports metrics via OpenTelemetry OTLP/HTTP.
|
||||
*
|
||||
* Replaces StatsD-based metric collection with native OTel Metrics SDK
|
||||
* instruments. Each beast::insight instrument maps to an OTel equivalent:
|
||||
*
|
||||
* - Counter -> OTel Counter<int64_t>
|
||||
* - Gauge -> OTel ObservableGauge<int64_t> (async callback)
|
||||
* - Event -> OTel Histogram<double> (duration in milliseconds)
|
||||
* - Meter -> OTel Counter<uint64_t> (monotonic, unsigned)
|
||||
* - Hook -> Called by PeriodicMetricReader at collection time
|
||||
*
|
||||
* @see StatsDCollector for the StatsD-based alternative.
|
||||
* @see NullCollector for the no-op fallback.
|
||||
*/
|
||||
class OTelCollector : public Collector
|
||||
{
|
||||
public:
|
||||
explicit OTelCollector() = default;
|
||||
|
||||
/**
|
||||
* @brief Factory method to create an OTelCollector instance.
|
||||
*
|
||||
* When XRPL_ENABLE_TELEMETRY is defined, creates a real OTel-backed
|
||||
* collector that exports metrics via OTLP/HTTP. When telemetry is
|
||||
* disabled at compile time, returns a NullCollector.
|
||||
*
|
||||
* @param endpoint OTLP/HTTP metrics endpoint URL
|
||||
* (e.g. "http://localhost:4318/v1/metrics").
|
||||
* @param prefix Prefix prepended to all metric names
|
||||
* (e.g. "rippled").
|
||||
* @param instanceId Unique identifier for this node instance,
|
||||
* emitted as the `service.instance.id` OTel
|
||||
* resource attribute. Defaults to empty string
|
||||
* (attribute omitted when empty).
|
||||
* @param journal Journal for logging.
|
||||
* @return Shared pointer to the created Collector.
|
||||
*/
|
||||
static std::shared_ptr<Collector>
|
||||
New(std::string const& endpoint,
|
||||
std::string const& prefix,
|
||||
std::string const& instanceId,
|
||||
Journal journal);
|
||||
};
|
||||
|
||||
} // namespace insight
|
||||
} // namespace beast
|
||||
879
src/libxrpl/beast/insight/OTelCollector.cpp
Normal file
879
src/libxrpl/beast/insight/OTelCollector.cpp
Normal file
@@ -0,0 +1,879 @@
|
||||
/**
|
||||
* @file OTelCollector.cpp
|
||||
* @brief OpenTelemetry Metrics SDK implementation of beast::insight::Collector.
|
||||
*
|
||||
* Compiled only when XRPL_ENABLE_TELEMETRY is defined (via CMake
|
||||
* telemetry=ON). Maps beast::insight instruments to OTel SDK instruments
|
||||
* and exports them via OTLP/HTTP using a PeriodicMetricReader.
|
||||
*
|
||||
* When XRPL_ENABLE_TELEMETRY is not defined, OTelCollector::New() returns
|
||||
* a NullCollector so the build succeeds without OTel dependencies.
|
||||
*
|
||||
* Data flow:
|
||||
*
|
||||
* beast::insight callers
|
||||
* |
|
||||
* v
|
||||
* OTelCounterImpl / OTelGaugeImpl / OTelEventImpl / OTelMeterImpl
|
||||
* | | | |
|
||||
* v v v v
|
||||
* OTel Counter<uint64> ObservableGauge Histogram<double> Counter<uint64>
|
||||
* | | | |
|
||||
* +--------------------+----------------+--------------+
|
||||
* |
|
||||
* v
|
||||
* PeriodicMetricReader (1s interval)
|
||||
* |
|
||||
* v
|
||||
* OtlpHttpMetricExporter -> OTel Collector -> Prometheus
|
||||
*/
|
||||
|
||||
#ifdef XRPL_ENABLE_TELEMETRY
|
||||
|
||||
#include <xrpl/beast/insight/CounterImpl.h>
|
||||
#include <xrpl/beast/insight/EventImpl.h>
|
||||
#include <xrpl/beast/insight/GaugeImpl.h>
|
||||
#include <xrpl/beast/insight/Hook.h>
|
||||
#include <xrpl/beast/insight/HookImpl.h>
|
||||
#include <xrpl/beast/insight/MeterImpl.h>
|
||||
#include <xrpl/beast/insight/OTelCollector.h>
|
||||
#include <xrpl/beast/utility/Journal.h>
|
||||
|
||||
#include <opentelemetry/exporters/otlp/otlp_http_metric_exporter_factory.h>
|
||||
#include <opentelemetry/exporters/otlp/otlp_http_metric_exporter_options.h>
|
||||
#include <opentelemetry/metrics/async_instruments.h>
|
||||
#include <opentelemetry/metrics/meter.h>
|
||||
#include <opentelemetry/metrics/meter_provider.h>
|
||||
#include <opentelemetry/metrics/observer_result.h>
|
||||
#include <opentelemetry/metrics/sync_instruments.h>
|
||||
#include <opentelemetry/sdk/metrics/export/periodic_exporting_metric_reader_factory.h>
|
||||
#include <opentelemetry/sdk/metrics/export/periodic_exporting_metric_reader_options.h>
|
||||
#include <opentelemetry/sdk/metrics/meter_provider.h>
|
||||
#include <opentelemetry/sdk/metrics/meter_provider_factory.h>
|
||||
#include <opentelemetry/sdk/metrics/view/instrument_selector_factory.h>
|
||||
#include <opentelemetry/sdk/metrics/view/meter_selector_factory.h>
|
||||
#include <opentelemetry/sdk/metrics/view/view_factory.h>
|
||||
#include <opentelemetry/sdk/resource/semantic_conventions.h>
|
||||
|
||||
#include <atomic>
|
||||
#include <chrono>
|
||||
#include <cstdint>
|
||||
#include <memory>
|
||||
#include <mutex>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace beast {
|
||||
namespace insight {
|
||||
|
||||
namespace detail {
|
||||
|
||||
namespace metrics_api = opentelemetry::metrics;
|
||||
namespace metrics_sdk = opentelemetry::sdk::metrics;
|
||||
namespace otlp_http = opentelemetry::exporter::otlp;
|
||||
namespace resource = opentelemetry::sdk::resource;
|
||||
|
||||
class OTelCollectorImp;
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief OTel-backed implementation of beast::insight::HookImpl.
|
||||
*
|
||||
* Stores a handler function that is invoked during each periodic
|
||||
* metric collection cycle. This mirrors the StatsDHookImpl pattern
|
||||
* where hooks are called at each 1-second timer tick, but here the
|
||||
* invocation is triggered by the OTel PeriodicMetricReader's
|
||||
* observable callback mechanism.
|
||||
*/
|
||||
class OTelHookImpl : public HookImpl
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @param handler Callback invoked at each collection interval.
|
||||
* @param impl Owning collector (prevents premature destruction).
|
||||
*/
|
||||
OTelHookImpl(HandlerType const& handler, std::shared_ptr<OTelCollectorImp> const& impl);
|
||||
|
||||
~OTelHookImpl() override;
|
||||
|
||||
/**
|
||||
* @brief Invoke the stored handler.
|
||||
*
|
||||
* Called by the collector during observable gauge callbacks to give
|
||||
* metric producers a chance to update gauge values before export.
|
||||
*/
|
||||
void
|
||||
callHandler();
|
||||
|
||||
private:
|
||||
OTelHookImpl&
|
||||
operator=(OTelHookImpl const&);
|
||||
|
||||
/** Owning collector. Prevents collector destruction while hook alive. */
|
||||
std::shared_ptr<OTelCollectorImp> m_impl;
|
||||
|
||||
/** User-supplied handler called at each collection interval. */
|
||||
HandlerType m_handler;
|
||||
};
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief OTel-backed implementation of beast::insight::CounterImpl.
|
||||
*
|
||||
* Wraps an OTel Counter<int64_t> instrument. Each increment() call
|
||||
* is forwarded directly to the OTel counter's Add() method. The
|
||||
* PeriodicMetricReader collects and exports the accumulated delta.
|
||||
*
|
||||
* Thread safety: OTel Counter::Add() is thread-safe by specification.
|
||||
*/
|
||||
class OTelCounterImpl : public CounterImpl
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @param name Fully-qualified metric name (prefix.group.name).
|
||||
* @param meter OTel Meter used to create the counter instrument.
|
||||
*/
|
||||
OTelCounterImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter);
|
||||
|
||||
~OTelCounterImpl() override = default;
|
||||
|
||||
/**
|
||||
* @brief Add amount to the counter.
|
||||
* @param amount Value to add (must be non-negative for OTel counters).
|
||||
*/
|
||||
void
|
||||
increment(value_type amount) override;
|
||||
|
||||
private:
|
||||
OTelCounterImpl&
|
||||
operator=(OTelCounterImpl const&);
|
||||
|
||||
/** OTel synchronous counter instrument. */
|
||||
opentelemetry::nostd::unique_ptr<metrics_api::Counter<uint64_t>> m_counter;
|
||||
};
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief OTel-backed implementation of beast::insight::EventImpl.
|
||||
*
|
||||
* Wraps an OTel Histogram<double> instrument. Each notify() call
|
||||
* records the duration in milliseconds. Uses explicit bucket boundaries
|
||||
* matching the SpanMetrics connector configuration:
|
||||
* [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms
|
||||
*
|
||||
* Thread safety: OTel Histogram::Record() is thread-safe by specification.
|
||||
*/
|
||||
class OTelEventImpl : public EventImpl
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @param name Fully-qualified metric name (prefix.group.name).
|
||||
* @param meter OTel Meter used to create the histogram instrument.
|
||||
*/
|
||||
OTelEventImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter);
|
||||
|
||||
~OTelEventImpl() override = default;
|
||||
|
||||
/**
|
||||
* @brief Record a duration measurement.
|
||||
* @param value Duration in milliseconds.
|
||||
*/
|
||||
void
|
||||
notify(value_type const& value) override;
|
||||
|
||||
private:
|
||||
OTelEventImpl&
|
||||
operator=(OTelEventImpl const&);
|
||||
|
||||
/** OTel histogram instrument for recording durations. */
|
||||
opentelemetry::nostd::unique_ptr<metrics_api::Histogram<double>> m_histogram;
|
||||
};
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief OTel-backed implementation of beast::insight::GaugeImpl.
|
||||
*
|
||||
* Uses an atomic int64_t to store the current gauge value. The OTel SDK
|
||||
* reads this value via an ObservableGauge async callback during each
|
||||
* collection cycle. The set() and increment() methods update the
|
||||
* atomic value without blocking the collection thread.
|
||||
*
|
||||
* Design note: OTel gauges are asynchronous (observable) instruments.
|
||||
* The SDK calls a registered callback to read the value rather than
|
||||
* accepting push-style updates. We bridge the beast::insight push-style
|
||||
* API to OTel's pull-style API via the atomic variable.
|
||||
*
|
||||
* Thread safety: std::atomic operations are lock-free on all platforms.
|
||||
*/
|
||||
class OTelGaugeImpl : public GaugeImpl
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @param name Fully-qualified metric name (prefix.group.name).
|
||||
* @param meter OTel Meter used to create the observable gauge.
|
||||
* @param collector Owning collector, used to invoke hooks before reads.
|
||||
*/
|
||||
OTelGaugeImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter,
|
||||
std::shared_ptr<OTelCollectorImp> const& collector);
|
||||
|
||||
~OTelGaugeImpl() override;
|
||||
|
||||
/**
|
||||
* @brief Set the gauge to an absolute value.
|
||||
* @param value New gauge value.
|
||||
*/
|
||||
void
|
||||
set(value_type value) override;
|
||||
|
||||
/**
|
||||
* @brief Increment (or decrement) the gauge by a signed amount.
|
||||
*
|
||||
* Clamps the result to [0, UINT64_MAX] to match StatsDGaugeImpl
|
||||
* behavior.
|
||||
*
|
||||
* @param amount Signed amount to add to the current value.
|
||||
*/
|
||||
void
|
||||
increment(difference_type amount) override;
|
||||
|
||||
/**
|
||||
* @brief Return the current gauge value for the OTel callback.
|
||||
* @return The most recently set/incremented value.
|
||||
*/
|
||||
int64_t
|
||||
currentValue() const;
|
||||
|
||||
private:
|
||||
OTelGaugeImpl&
|
||||
operator=(OTelGaugeImpl const&);
|
||||
|
||||
/** Current gauge value, updated atomically by set()/increment(). */
|
||||
std::atomic<int64_t> m_value{0};
|
||||
|
||||
/** OTel observable gauge handle (prevents deregistration). */
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::ObservableInstrument> m_gauge;
|
||||
|
||||
/** Owning collector, used to invoke hooks before reading gauge values. */
|
||||
std::shared_ptr<OTelCollectorImp> m_collector;
|
||||
};
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief OTel-backed implementation of beast::insight::MeterImpl.
|
||||
*
|
||||
* Wraps an OTel Counter<uint64_t> instrument. Semantically identical
|
||||
* to Counter but uses unsigned values. The OTel SDK accumulates deltas
|
||||
* and exports them via the PeriodicMetricReader.
|
||||
*
|
||||
* Note: In StatsD, Meter used the non-standard "|m" type which was
|
||||
* silently dropped by the OTel StatsD receiver. With native OTel,
|
||||
* Meter values are properly captured as counter deltas.
|
||||
*
|
||||
* Thread safety: OTel Counter::Add() is thread-safe by specification.
|
||||
*/
|
||||
class OTelMeterImpl : public MeterImpl
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @param name Fully-qualified metric name (prefix.group.name).
|
||||
* @param meter OTel Meter used to create the counter instrument.
|
||||
*/
|
||||
OTelMeterImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter);
|
||||
|
||||
~OTelMeterImpl() override = default;
|
||||
|
||||
/**
|
||||
* @brief Add amount to the meter.
|
||||
* @param amount Value to add (unsigned).
|
||||
*/
|
||||
void
|
||||
increment(value_type amount) override;
|
||||
|
||||
private:
|
||||
OTelMeterImpl&
|
||||
operator=(OTelMeterImpl const&);
|
||||
|
||||
/** OTel synchronous counter instrument (unsigned). */
|
||||
opentelemetry::nostd::unique_ptr<metrics_api::Counter<uint64_t>> m_counter;
|
||||
};
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* @brief Main OTel Collector implementation.
|
||||
*
|
||||
* Creates an OTel MeterProvider with a PeriodicMetricReader that
|
||||
* exports metrics via OTLP/HTTP at 1-second intervals. Implements
|
||||
* all Collector::make_*() factory methods to create OTel-backed
|
||||
* instrument wrappers.
|
||||
*
|
||||
* Class diagram:
|
||||
*
|
||||
* +------------------+ +------------------+
|
||||
* | Collector (ABC) |<-----| OTelCollector |
|
||||
* +------------------+ | (public header) |
|
||||
* ^ +------------------+
|
||||
* | ^
|
||||
* +------------------+ |
|
||||
* | OTelCollectorImp |-------------+
|
||||
* +------------------+
|
||||
* | - m_journal |
|
||||
* | - m_prefix |
|
||||
* | - m_provider | +---------------------+
|
||||
* | - m_otelMeter |---->| OTel MeterProvider |
|
||||
* | - m_hooks[] | | + PeriodicReader |
|
||||
* | - m_gauges[] | | + OtlpHttpExporter |
|
||||
* +------------------+ +---------------------+
|
||||
*
|
||||
* Lifecycle:
|
||||
* 1. Constructor creates MeterProvider + exporter pipeline.
|
||||
* 2. make_*() methods create instruments registered with the provider.
|
||||
* 3. PeriodicMetricReader collects every 1s, calling observable callbacks.
|
||||
* 4. Observable callbacks invoke hooks, read gauge atomics.
|
||||
* 5. Destructor shuts down MeterProvider (flushes pending exports).
|
||||
*
|
||||
* Caveats:
|
||||
* - Observable gauge callbacks run on the SDK's internal thread. Hook
|
||||
* handlers must be thread-safe.
|
||||
* - Metric names are formed as "prefix_name" with dots replaced by
|
||||
* underscores to match StatsD->Prometheus naming conventions.
|
||||
* - The OTel Prometheus exporter appends "_total" to counters. The
|
||||
* metric names we register do NOT include this suffix — Prometheus
|
||||
* adds it automatically.
|
||||
*
|
||||
* Example usage:
|
||||
* @code
|
||||
* auto collector = OTelCollector::New(
|
||||
* "http://localhost:4318/v1/metrics", "rippled", journal);
|
||||
* auto counter = collector->make_counter("rpc.requests");
|
||||
* counter.increment(1);
|
||||
* // Metric "rippled_rpc_requests" exported via OTLP every 1s.
|
||||
* @endcode
|
||||
*/
|
||||
class OTelCollectorImp : public OTelCollector, public std::enable_shared_from_this<OTelCollectorImp>
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* @brief Construct the OTel collector and initialize the export pipeline.
|
||||
*
|
||||
* @param endpoint OTLP/HTTP metrics endpoint URL.
|
||||
* @param prefix Prefix for all metric names.
|
||||
* @param instanceId Value for the service.instance.id resource attribute.
|
||||
* When empty, the attribute is omitted.
|
||||
* @param journal Journal for logging.
|
||||
*/
|
||||
OTelCollectorImp(
|
||||
std::string const& endpoint,
|
||||
std::string const& prefix,
|
||||
std::string const& instanceId,
|
||||
Journal journal);
|
||||
|
||||
/**
|
||||
* @brief Shut down the MeterProvider, flushing any pending exports.
|
||||
*/
|
||||
~OTelCollectorImp() override;
|
||||
|
||||
/** @name Collector interface implementation */
|
||||
/** @{ */
|
||||
Hook
|
||||
make_hook(HookImpl::HandlerType const& handler) override;
|
||||
|
||||
Counter
|
||||
make_counter(std::string const& name) override;
|
||||
|
||||
Event
|
||||
make_event(std::string const& name) override;
|
||||
|
||||
Gauge
|
||||
make_gauge(std::string const& name) override;
|
||||
|
||||
Meter
|
||||
make_meter(std::string const& name) override;
|
||||
/** @} */
|
||||
|
||||
/** @name Hook management for observable callbacks */
|
||||
/** @{ */
|
||||
|
||||
/**
|
||||
* @brief Register a hook for periodic invocation.
|
||||
* @param hook Pointer to the hook to register.
|
||||
*/
|
||||
void
|
||||
addHook(OTelHookImpl* hook);
|
||||
|
||||
/**
|
||||
* @brief Unregister a hook.
|
||||
* @param hook Pointer to the hook to unregister.
|
||||
*/
|
||||
void
|
||||
removeHook(OTelHookImpl* hook);
|
||||
|
||||
/**
|
||||
* @brief Invoke all registered hooks.
|
||||
*
|
||||
* Called from observable gauge callbacks before reading gauge values,
|
||||
* so that hook handlers have a chance to update metrics.
|
||||
*/
|
||||
void
|
||||
callHooks();
|
||||
/** @} */
|
||||
|
||||
/** @name Gauge registration for observable callbacks */
|
||||
/** @{ */
|
||||
|
||||
/**
|
||||
* @brief Register a gauge for observable callback reading.
|
||||
* @param gauge Pointer to the gauge to register.
|
||||
*/
|
||||
void
|
||||
addGauge(OTelGaugeImpl* gauge);
|
||||
|
||||
/**
|
||||
* @brief Unregister a gauge.
|
||||
* @param gauge Pointer to the gauge to unregister.
|
||||
*/
|
||||
void
|
||||
removeGauge(OTelGaugeImpl* gauge);
|
||||
/** @} */
|
||||
|
||||
/**
|
||||
* @brief Get the OTel Meter instance for creating instruments.
|
||||
* @return Shared pointer to the OTel Meter.
|
||||
*/
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const&
|
||||
otelMeter() const;
|
||||
|
||||
/**
|
||||
* @brief Format a metric name with the configured prefix.
|
||||
*
|
||||
* Replaces dots with underscores to match StatsD->Prometheus naming.
|
||||
* Example: prefix="rippled", name="LedgerMaster.Validated_Ledger_Age"
|
||||
* -> "rippled_LedgerMaster_Validated_Ledger_Age"
|
||||
*
|
||||
* @param name Raw metric name from beast::insight callers.
|
||||
* @return Fully-qualified metric name.
|
||||
*/
|
||||
std::string
|
||||
formatName(std::string const& name) const;
|
||||
|
||||
private:
|
||||
/** Journal for log output. */
|
||||
Journal m_journal;
|
||||
|
||||
/** Prefix for all metric names (e.g., "rippled"). */
|
||||
std::string m_prefix;
|
||||
|
||||
/** OTel SDK MeterProvider owning the export pipeline. RAII lifecycle. */
|
||||
std::shared_ptr<metrics_sdk::MeterProvider> m_provider;
|
||||
|
||||
/** OTel Meter used to create all instruments. */
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> m_otelMeter;
|
||||
|
||||
/** Mutex protecting hook and gauge registration lists. */
|
||||
std::mutex m_mutex;
|
||||
|
||||
/** Registered hooks called during observable callbacks. */
|
||||
std::vector<OTelHookImpl*> m_hooks;
|
||||
|
||||
/** Registered gauges read during observable callbacks. */
|
||||
std::vector<OTelGaugeImpl*> m_gauges;
|
||||
|
||||
/**
|
||||
* @brief Debounce timestamp for callHooks().
|
||||
*
|
||||
* Multiple gauge callbacks fire during the same collection cycle.
|
||||
* This atomic tracks the last time hooks were invoked (ms since epoch).
|
||||
* Hooks are called at most once per 500ms window to avoid redundant
|
||||
* invocations while still ensuring fresh values each collection cycle.
|
||||
*/
|
||||
std::atomic<int64_t> m_lastHookCallMs{0};
|
||||
};
|
||||
|
||||
//==============================================================================
|
||||
// Implementation
|
||||
//==============================================================================
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelHookImpl
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelHookImpl::OTelHookImpl(
|
||||
HandlerType const& handler,
|
||||
std::shared_ptr<OTelCollectorImp> const& impl)
|
||||
: m_impl(impl), m_handler(handler)
|
||||
{
|
||||
m_impl->addHook(this);
|
||||
}
|
||||
|
||||
OTelHookImpl::~OTelHookImpl()
|
||||
{
|
||||
m_impl->removeHook(this);
|
||||
}
|
||||
|
||||
void
|
||||
OTelHookImpl::callHandler()
|
||||
{
|
||||
m_handler();
|
||||
}
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelCounterImpl
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelCounterImpl::OTelCounterImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter)
|
||||
: m_counter(meter->CreateUInt64Counter(name))
|
||||
{
|
||||
}
|
||||
|
||||
void
|
||||
OTelCounterImpl::increment(value_type amount)
|
||||
{
|
||||
// OTel counters require non-negative values. beast::insight CounterImpl
|
||||
// uses int64_t, so clamp negative values to 0 and cast to uint64_t.
|
||||
if (amount > 0)
|
||||
m_counter->Add(static_cast<uint64_t>(amount));
|
||||
}
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelEventImpl
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelEventImpl::OTelEventImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter)
|
||||
: m_histogram(meter->CreateDoubleHistogram(name, "Duration in ms", "ms"))
|
||||
{
|
||||
}
|
||||
|
||||
void
|
||||
OTelEventImpl::notify(value_type const& value)
|
||||
{
|
||||
m_histogram->Record(static_cast<double>(value.count()), opentelemetry::context::Context{});
|
||||
}
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelGaugeImpl
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelGaugeImpl::OTelGaugeImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter,
|
||||
std::shared_ptr<OTelCollectorImp> const& collector)
|
||||
: m_gauge(meter->CreateInt64ObservableGauge(name)), m_collector(collector)
|
||||
{
|
||||
m_collector->addGauge(this);
|
||||
|
||||
// Register the async callback that the SDK calls during collection.
|
||||
// Before reading the gauge value, invoke all registered hooks so that
|
||||
// hook handlers (e.g. NetworkOPs State_Accounting) have a chance to
|
||||
// update gauge values. callHooks() uses a debounce timestamp so hooks
|
||||
// run at most once per collection cycle even with many gauges.
|
||||
m_gauge->AddCallback(
|
||||
[](opentelemetry::metrics::ObserverResult result, void* state) {
|
||||
auto* self = static_cast<OTelGaugeImpl*>(state);
|
||||
self->m_collector->callHooks();
|
||||
if (auto intResult = opentelemetry::nostd::get_if<opentelemetry::nostd::shared_ptr<
|
||||
opentelemetry::metrics::ObserverResultT<int64_t>>>(&result))
|
||||
{
|
||||
(*intResult)->Observe(self->currentValue());
|
||||
}
|
||||
},
|
||||
this);
|
||||
}
|
||||
|
||||
OTelGaugeImpl::~OTelGaugeImpl()
|
||||
{
|
||||
m_collector->removeGauge(this);
|
||||
}
|
||||
|
||||
void
|
||||
OTelGaugeImpl::set(value_type value)
|
||||
{
|
||||
m_value.store(static_cast<int64_t>(value), std::memory_order_relaxed);
|
||||
}
|
||||
|
||||
void
|
||||
OTelGaugeImpl::increment(difference_type amount)
|
||||
{
|
||||
// Use compare-exchange loop to safely clamp to [0, MAX].
|
||||
int64_t current = m_value.load(std::memory_order_relaxed);
|
||||
int64_t desired;
|
||||
do
|
||||
{
|
||||
desired = current + amount;
|
||||
// Clamp to 0 on underflow.
|
||||
if (desired < 0)
|
||||
desired = 0;
|
||||
} while (!m_value.compare_exchange_weak(current, desired, std::memory_order_relaxed));
|
||||
}
|
||||
|
||||
int64_t
|
||||
OTelGaugeImpl::currentValue() const
|
||||
{
|
||||
return m_value.load(std::memory_order_relaxed);
|
||||
}
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelMeterImpl
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelMeterImpl::OTelMeterImpl(
|
||||
std::string const& name,
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const& meter)
|
||||
: m_counter(meter->CreateUInt64Counter(name))
|
||||
{
|
||||
}
|
||||
|
||||
void
|
||||
OTelMeterImpl::increment(value_type amount)
|
||||
{
|
||||
m_counter->Add(amount);
|
||||
}
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// OTelCollectorImp
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
OTelCollectorImp::OTelCollectorImp(
|
||||
std::string const& endpoint,
|
||||
std::string const& prefix,
|
||||
std::string const& instanceId,
|
||||
Journal journal)
|
||||
: m_journal(journal), m_prefix(prefix)
|
||||
{
|
||||
if (m_journal.info())
|
||||
m_journal.info() << "OTelCollector starting: endpoint=" << endpoint
|
||||
<< " prefix=" << m_prefix;
|
||||
|
||||
// Configure OTLP HTTP metric exporter.
|
||||
otlp_http::OtlpHttpMetricExporterOptions exporterOpts;
|
||||
exporterOpts.url = endpoint;
|
||||
|
||||
auto exporter = otlp_http::OtlpHttpMetricExporterFactory::Create(exporterOpts);
|
||||
|
||||
// Configure periodic metric reader (1-second export interval).
|
||||
metrics_sdk::PeriodicExportingMetricReaderOptions readerOpts;
|
||||
readerOpts.export_interval_millis = std::chrono::milliseconds(1000);
|
||||
readerOpts.export_timeout_millis = std::chrono::milliseconds(500);
|
||||
|
||||
auto reader =
|
||||
metrics_sdk::PeriodicExportingMetricReaderFactory::Create(std::move(exporter), readerOpts);
|
||||
|
||||
// Configure resource attributes matching the trace exporter.
|
||||
// Include service.instance.id when provided so Prometheus
|
||||
// exported_instance labels distinguish multi-node deployments.
|
||||
resource::ResourceAttributes attrs;
|
||||
attrs[resource::SemanticConventions::kServiceName] = "rippled";
|
||||
if (!instanceId.empty())
|
||||
attrs[resource::SemanticConventions::kServiceInstanceId] = instanceId;
|
||||
auto resourceAttrs = resource::Resource::Create(attrs);
|
||||
|
||||
// Create MeterProvider with resource, then attach the metric reader.
|
||||
m_provider = metrics_sdk::MeterProviderFactory::Create(
|
||||
std::make_unique<metrics_sdk::ViewRegistry>(), resourceAttrs);
|
||||
m_provider->AddMetricReader(std::move(reader));
|
||||
|
||||
// Configure histogram bucket boundaries for Event instruments.
|
||||
// These match the SpanMetrics connector buckets for consistency.
|
||||
auto histogramSelector = metrics_sdk::InstrumentSelectorFactory::Create(
|
||||
metrics_sdk::InstrumentType::kHistogram, "*", "ms");
|
||||
auto meterSelector = metrics_sdk::MeterSelectorFactory::Create("rippled_metrics", "", "");
|
||||
auto histogramConfig = std::make_shared<metrics_sdk::HistogramAggregationConfig>();
|
||||
histogramConfig->boundaries_ =
|
||||
std::vector<double>{1.0, 5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 5000.0};
|
||||
auto histogramView = metrics_sdk::ViewFactory::Create(
|
||||
"default_histogram",
|
||||
"Default histogram view with SpanMetrics-compatible buckets",
|
||||
"ms",
|
||||
metrics_sdk::AggregationType::kHistogram,
|
||||
std::move(histogramConfig));
|
||||
|
||||
m_provider->AddView(
|
||||
std::move(histogramSelector), std::move(meterSelector), std::move(histogramView));
|
||||
|
||||
// Create the OTel Meter for creating instruments.
|
||||
m_otelMeter = m_provider->GetMeter("rippled_metrics", "1.0.0");
|
||||
|
||||
if (m_journal.info())
|
||||
m_journal.info() << "OTelCollector started successfully";
|
||||
}
|
||||
|
||||
OTelCollectorImp::~OTelCollectorImp()
|
||||
{
|
||||
if (m_journal.info())
|
||||
m_journal.info() << "OTelCollector shutting down";
|
||||
if (m_provider)
|
||||
{
|
||||
// ForceFlush to export any pending metrics before shutdown.
|
||||
m_provider->ForceFlush();
|
||||
m_provider->Shutdown();
|
||||
}
|
||||
if (m_journal.info())
|
||||
m_journal.info() << "OTelCollector stopped";
|
||||
}
|
||||
|
||||
Hook
|
||||
OTelCollectorImp::make_hook(HookImpl::HandlerType const& handler)
|
||||
{
|
||||
return Hook(std::make_shared<OTelHookImpl>(handler, shared_from_this()));
|
||||
}
|
||||
|
||||
Counter
|
||||
OTelCollectorImp::make_counter(std::string const& name)
|
||||
{
|
||||
return Counter(std::make_shared<OTelCounterImpl>(formatName(name), m_otelMeter));
|
||||
}
|
||||
|
||||
Event
|
||||
OTelCollectorImp::make_event(std::string const& name)
|
||||
{
|
||||
return Event(std::make_shared<OTelEventImpl>(formatName(name), m_otelMeter));
|
||||
}
|
||||
|
||||
Gauge
|
||||
OTelCollectorImp::make_gauge(std::string const& name)
|
||||
{
|
||||
return Gauge(
|
||||
std::make_shared<OTelGaugeImpl>(formatName(name), m_otelMeter, shared_from_this()));
|
||||
}
|
||||
|
||||
Meter
|
||||
OTelCollectorImp::make_meter(std::string const& name)
|
||||
{
|
||||
return Meter(std::make_shared<OTelMeterImpl>(formatName(name), m_otelMeter));
|
||||
}
|
||||
|
||||
void
|
||||
OTelCollectorImp::addHook(OTelHookImpl* hook)
|
||||
{
|
||||
std::lock_guard lock(m_mutex);
|
||||
m_hooks.push_back(hook);
|
||||
}
|
||||
|
||||
void
|
||||
OTelCollectorImp::removeHook(OTelHookImpl* hook)
|
||||
{
|
||||
std::lock_guard lock(m_mutex);
|
||||
m_hooks.erase(std::remove(m_hooks.begin(), m_hooks.end(), hook), m_hooks.end());
|
||||
}
|
||||
|
||||
void
|
||||
OTelCollectorImp::callHooks()
|
||||
{
|
||||
// Debounce: hooks run at most once per 500ms. Multiple gauge callbacks
|
||||
// fire during the same collection cycle — only the first one triggers
|
||||
// hooks. Subsequent callbacks within the window read already-updated
|
||||
// gauge values.
|
||||
auto now = std::chrono::duration_cast<std::chrono::milliseconds>(
|
||||
std::chrono::steady_clock::now().time_since_epoch())
|
||||
.count();
|
||||
auto last = m_lastHookCallMs.load(std::memory_order_relaxed);
|
||||
if (now - last < 500)
|
||||
return;
|
||||
if (!m_lastHookCallMs.compare_exchange_strong(last, now, std::memory_order_relaxed))
|
||||
return; // Another thread won the race.
|
||||
|
||||
std::lock_guard lock(m_mutex);
|
||||
for (auto* hook : m_hooks)
|
||||
hook->callHandler();
|
||||
}
|
||||
|
||||
void
|
||||
OTelCollectorImp::addGauge(OTelGaugeImpl* gauge)
|
||||
{
|
||||
std::lock_guard lock(m_mutex);
|
||||
m_gauges.push_back(gauge);
|
||||
}
|
||||
|
||||
void
|
||||
OTelCollectorImp::removeGauge(OTelGaugeImpl* gauge)
|
||||
{
|
||||
std::lock_guard lock(m_mutex);
|
||||
m_gauges.erase(std::remove(m_gauges.begin(), m_gauges.end(), gauge), m_gauges.end());
|
||||
}
|
||||
|
||||
opentelemetry::nostd::shared_ptr<metrics_api::Meter> const&
|
||||
OTelCollectorImp::otelMeter() const
|
||||
{
|
||||
return m_otelMeter;
|
||||
}
|
||||
|
||||
std::string
|
||||
OTelCollectorImp::formatName(std::string const& name) const
|
||||
{
|
||||
// StatsD uses "prefix.group.name" format. The OTel StatsD receiver
|
||||
// converts dots to underscores for Prometheus. We replicate this
|
||||
// to preserve metric name compatibility.
|
||||
//
|
||||
// Example: prefix="rippled", name="LedgerMaster.Validated_Ledger_Age"
|
||||
// -> "rippled_LedgerMaster_Validated_Ledger_Age"
|
||||
std::string result;
|
||||
if (!m_prefix.empty())
|
||||
{
|
||||
result = m_prefix;
|
||||
result += '_';
|
||||
}
|
||||
for (char c : name)
|
||||
{
|
||||
result += (c == '.') ? '_' : c;
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
} // namespace detail
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
|
||||
std::shared_ptr<Collector>
|
||||
OTelCollector::New(
|
||||
std::string const& endpoint,
|
||||
std::string const& prefix,
|
||||
std::string const& instanceId,
|
||||
Journal journal)
|
||||
{
|
||||
return std::make_shared<detail::OTelCollectorImp>(endpoint, prefix, instanceId, journal);
|
||||
}
|
||||
|
||||
} // namespace insight
|
||||
} // namespace beast
|
||||
|
||||
#else // !XRPL_ENABLE_TELEMETRY
|
||||
|
||||
// When telemetry is disabled at compile time, OTelCollector::New()
|
||||
// returns a NullCollector so callers do not need conditional logic.
|
||||
|
||||
#include <xrpl/beast/insight/NullCollector.h>
|
||||
#include <xrpl/beast/insight/OTelCollector.h>
|
||||
|
||||
namespace beast {
|
||||
namespace insight {
|
||||
|
||||
std::shared_ptr<Collector>
|
||||
OTelCollector::New(
|
||||
std::string const& /* endpoint */,
|
||||
std::string const& /* prefix */,
|
||||
std::string const& /* instanceId */,
|
||||
Journal /* journal */)
|
||||
{
|
||||
return NullCollector::New();
|
||||
}
|
||||
|
||||
} // namespace insight
|
||||
} // namespace beast
|
||||
|
||||
#endif // XRPL_ENABLE_TELEMETRY
|
||||
@@ -23,6 +23,24 @@ public:
|
||||
|
||||
m_collector = beast::insight::StatsDCollector::New(address, prefix, journal);
|
||||
}
|
||||
// LCOV_EXCL_START -- OTel collector path is not exercised in unit tests
|
||||
else if (server == "otel")
|
||||
{
|
||||
// Read OTLP metrics endpoint from [insight] section.
|
||||
// Default to the standard OTLP/HTTP metrics path on localhost.
|
||||
std::string endpoint = get(params, "endpoint");
|
||||
if (endpoint.empty())
|
||||
endpoint = "http://localhost:4318/v1/metrics";
|
||||
std::string const& prefix(get(params, "prefix"));
|
||||
|
||||
// Read service_instance_id, same key as the [telemetry]
|
||||
// section uses, so multi-node deployments can distinguish
|
||||
// metric sources via the exported_instance Prometheus label.
|
||||
std::string const instanceId = get(params, "service_instance_id");
|
||||
|
||||
m_collector = beast::insight::OTelCollector::New(endpoint, prefix, instanceId, journal);
|
||||
}
|
||||
// LCOV_EXCL_STOP
|
||||
else
|
||||
{
|
||||
m_collector = beast::insight::NullCollector::New();
|
||||
|
||||
Reference in New Issue
Block a user