mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl behind the existing beast::insight::Collector interface. Maps Counter, Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to same collector endpoint as traces. Eliminates StatsD UDP dependency. Resolves deferred Phase 6 Task 6.1 (|m wire format). Phase 8 (log correlation): Inject trace_id/span_id into JLOG output via Logs::format() thread-local span context read. Add Grafana Loki with OTel Collector filelog receiver for centralized log ingestion. Enable bidirectional Tempo-Loki correlation in Grafana. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
408 lines
19 KiB
Markdown
408 lines
19 KiB
Markdown
# Phase 7: Native OTel Metrics Migration — Task List
|
|
|
|
> **Goal**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency.
|
|
>
|
|
> **Scope**: New `OTelCollectorImpl` class, `CollectorManager` config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.
|
|
>
|
|
> **Branch**: `pratik/otel-phase7-native-metrics` (from `pratik/otel-phase6-statsd`)
|
|
|
|
### Related Plan Documents
|
|
|
|
| Document | Relevance |
|
|
| -------------------------------------------------------------------- | -------------------------------------------------------------------------- |
|
|
| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy (replaced) |
|
|
| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections |
|
|
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 summary, exit criteria, success metrics (§6.8) |
|
|
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved |
|
|
|
|
---
|
|
|
|
## Motivation: Why Migrate from StatsD to Native OTel Metrics
|
|
|
|
### What We Gain
|
|
|
|
1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."
|
|
|
|
2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.
|
|
|
|
3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved).
|
|
|
|
4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.
|
|
|
|
5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.
|
|
|
|
6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics.
|
|
|
|
7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`.
|
|
|
|
### What We Lose
|
|
|
|
1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback.
|
|
|
|
2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.
|
|
|
|
3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.
|
|
|
|
4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.
|
|
|
|
### Decision
|
|
|
|
The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period.
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### Class Hierarchy (after Phase 7)
|
|
|
|
```
|
|
beast::insight::Collector (abstract interface — unchanged)
|
|
|
|
|
+-- StatsDCollector (existing — retained as fallback, deprecated)
|
|
| +-- StatsDCounterImpl -> StatsD |c over UDP
|
|
| +-- StatsDGaugeImpl -> StatsD |g over UDP
|
|
| +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard)
|
|
| +-- StatsDEventImpl -> StatsD |ms over UDP
|
|
| +-- StatsDHookImpl -> 1s periodic callback
|
|
|
|
|
+-- NullCollector (existing — unchanged, used when disabled)
|
|
| +-- NullCounterImpl -> no-op
|
|
| +-- NullGaugeImpl -> no-op
|
|
| +-- NullMeterImpl -> no-op
|
|
| +-- NullEventImpl -> no-op
|
|
| +-- NullHookImpl -> no-op
|
|
|
|
|
+-- OTelCollector (NEW — Phase 7)
|
|
+-- OTelCounterImpl -> otel::Counter<int64_t>
|
|
+-- OTelGaugeImpl -> otel::ObservableGauge<uint64_t>
|
|
+-- OTelMeterImpl -> otel::Counter<uint64_t>
|
|
+-- OTelEventImpl -> otel::Histogram<double>
|
|
+-- OTelHookImpl -> 1s periodic callback (same pattern)
|
|
```
|
|
|
|
### Instrument Type Mapping
|
|
|
|
| beast::insight Type | OTel Metrics SDK Instrument | Rationale |
|
|
| --------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
|
|
| `Counter` (int64, delta, `\|c`) | `Counter<int64_t>` | Direct 1:1 — both are monotonic delta counters |
|
|
| `Gauge` (uint64, current value, `\|g`) | `ObservableGauge<uint64_t>` with async callback | OTel gauges use async observation via callbacks; the existing Hook pattern already provides periodic polling |
|
|
| `Meter` (uint64, increment-only, `\|m`) | `Counter<uint64_t>` | Meters are semantically counters — this fixes the non-standard `\|m` wire format issue from Phase 6 Task 6.1 |
|
|
| `Event` (ms duration, `\|ms`) | `Histogram<double>` with explicit buckets | Duration distributions — use same buckets as SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms |
|
|
| `Hook` (1s periodic callback) | `PeriodicMetricReader` callback alignment | Collection interval matches existing 1s period; hooks fire gauge observations before export |
|
|
|
|
### Data Flow (after Phase 7)
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph rippledNode["rippled Node"]
|
|
A["Trace Macros<br/>XRPL_TRACE_SPAN"]
|
|
B["beast::insight<br/>OTelCollector"]
|
|
end
|
|
|
|
subgraph collector["OTel Collector :4317 / :4318"]
|
|
direction TB
|
|
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
|
|
BP["Batch Processor"]
|
|
SM["SpanMetrics Connector"]
|
|
|
|
R1 --> BP
|
|
BP --> SM
|
|
end
|
|
|
|
subgraph backends["Trace Backends"]
|
|
D["Jaeger / Tempo"]
|
|
end
|
|
|
|
subgraph metrics["Metrics Stack"]
|
|
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
|
|
end
|
|
|
|
subgraph viz["Visualization"]
|
|
F["Grafana :3000"]
|
|
end
|
|
|
|
A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
|
|
B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1
|
|
|
|
BP -->|"OTLP/gRPC"| D
|
|
SM -->|"RED metrics"| E
|
|
R1 -->|"rippled_* metrics<br/>(native OTLP)"| E
|
|
|
|
E --> F
|
|
D --> F
|
|
|
|
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
|
|
style B fill:#d9534f,color:#fff,stroke:#b52d2d
|
|
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
|
|
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
|
|
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
|
|
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
|
|
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
|
|
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
|
|
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
|
|
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
|
|
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
|
|
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
|
|
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
|
|
```
|
|
|
|
**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.
|
|
|
|
### Configuration
|
|
|
|
```ini
|
|
# [insight] section — new "otel" server option
|
|
[insight]
|
|
server=otel # NEW: uses OTel OTLP metrics exporter
|
|
prefix=rippled # metric name prefix (preserved)
|
|
|
|
# Endpoint and auth inherited from [telemetry] section:
|
|
[telemetry]
|
|
enabled=1
|
|
endpoint=http://localhost:4318/v1/traces
|
|
```
|
|
|
|
The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed.
|
|
|
|
**Backward compatibility**: `server=statsd` continues to work exactly as before.
|
|
|
|
---
|
|
|
|
## Task 7.1: Add OTel Metrics SDK to Build Dependencies
|
|
|
|
**Objective**: Enable the OTel C++ Metrics SDK components in the build system.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `conanfile.py`:
|
|
- Add OTel metrics SDK components to the dependency list when `telemetry=True`
|
|
- Components needed: `opentelemetry-cpp::metrics`, `opentelemetry-cpp::otlp_http_metric_exporter`
|
|
|
|
- Edit `CMakeLists.txt` (telemetry section):
|
|
- Link `opentelemetry::metrics` and `opentelemetry::otlp_http_metric_exporter` targets
|
|
|
|
**Key modified files**:
|
|
|
|
- `conanfile.py`
|
|
- `CMakeLists.txt` (or the relevant telemetry cmake target)
|
|
|
|
**Reference**: [05-configuration-reference.md §5.3](./05-configuration-reference.md) — CMake integration
|
|
|
|
---
|
|
|
|
## Task 7.2: Implement OTelCollector Class
|
|
|
|
**Objective**: Create the core `OTelCollector` implementation that maps beast::insight instruments to OTel Metrics SDK instruments.
|
|
|
|
**What to do**:
|
|
|
|
- Create `include/xrpl/beast/insight/OTelCollector.h`:
|
|
- Public factory: `static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)`
|
|
- Derives from `StatsDCollector` (or directly from `Collector` — TBD based on shared code)
|
|
|
|
- Create `src/libxrpl/beast/insight/OTelCollector.cpp` (~400-500 lines):
|
|
- **OTelCounterImpl**: Wraps `opentelemetry::metrics::Counter<int64_t>`. `increment(amount)` calls `counter->Add(amount)`.
|
|
- **OTelGaugeImpl**: Uses `opentelemetry::metrics::ObservableGauge<uint64_t>` with an async callback. `set(value)` stores value atomically; callback reads it during collection.
|
|
- **OTelMeterImpl**: Wraps `opentelemetry::metrics::Counter<uint64_t>`. `increment(amount)` calls `counter->Add(amount)`. Semantically identical to Counter but unsigned.
|
|
- **OTelEventImpl**: Wraps `opentelemetry::metrics::Histogram<double>`. `notify(duration)` calls `histogram->Record(duration.count())`. Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms.
|
|
- **OTelHookImpl**: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
|
|
- **OTelCollectorImp**: Main class.
|
|
- Creates `MeterProvider` with `PeriodicMetricReader` (1s export interval)
|
|
- Creates `OtlpHttpMetricExporter` pointing to `[telemetry]` endpoint
|
|
- Sets resource attributes (service.name, service.instance.id) matching trace exporter
|
|
- Implements all `make_*()` factory methods
|
|
- Prefixes metric names with `[insight] prefix=` value
|
|
|
|
- Guard all OTel SDK includes with `#ifdef XRPL_ENABLE_TELEMETRY` to compile to `NullCollector` equivalents when telemetry disabled.
|
|
|
|
**Key new files**:
|
|
|
|
- `include/xrpl/beast/insight/OTelCollector.h`
|
|
- `src/libxrpl/beast/insight/OTelCollector.cpp`
|
|
|
|
**Key patterns to follow**:
|
|
|
|
- Match `StatsDCollector.cpp` structure: private impl classes, intrusive list for metrics, strand-based thread safety
|
|
- Match existing telemetry code style from `src/libxrpl/telemetry/Telemetry.cpp`
|
|
- Use RAII for MeterProvider lifecycle (shutdown on destructor)
|
|
|
|
**Reference**: [04-code-samples.md](./04-code-samples.md) — code style and patterns
|
|
|
|
---
|
|
|
|
## Task 7.3: Update CollectorManager
|
|
|
|
**Objective**: Add `server=otel` config option to route metric creation to the new OTel backend.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `src/xrpld/app/main/CollectorManager.cpp`:
|
|
- In the constructor, add a third branch after `server == "statsd"`:
|
|
```cpp
|
|
else if (server == "otel")
|
|
{
|
|
// Read endpoint from [telemetry] section
|
|
auto const endpoint = get(telemetryParams, "endpoint",
|
|
"http://localhost:4318/v1/metrics");
|
|
std::string const& prefix(get(params, "prefix"));
|
|
m_collector = beast::insight::OTelCollector::New(
|
|
endpoint, prefix, journal);
|
|
}
|
|
```
|
|
- This requires access to the `[telemetry]` config section — may need to pass it as a parameter or read from Application config.
|
|
|
|
- Edit `src/xrpld/app/main/CollectorManager.h`:
|
|
- Add `#include <xrpl/beast/insight/OTelCollector.h>`
|
|
|
|
**Key modified files**:
|
|
|
|
- `src/xrpld/app/main/CollectorManager.cpp`
|
|
- `src/xrpld/app/main/CollectorManager.h`
|
|
|
|
---
|
|
|
|
## Task 7.4: Update OTel Collector Configuration
|
|
|
|
**Objective**: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docker/telemetry/otel-collector-config.yaml`:
|
|
- Remove `statsd` receiver (no longer needed when `server=otel`)
|
|
- Add metrics pipeline under `service.pipelines`:
|
|
```yaml
|
|
metrics:
|
|
receivers: [otlp, spanmetrics]
|
|
processors: [batch]
|
|
exporters: [prometheus]
|
|
```
|
|
- The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
|
|
- Keep `spanmetrics` connector in the metrics pipeline so span-derived RED metrics continue working.
|
|
|
|
- Edit `docker/telemetry/docker-compose.yml`:
|
|
- Remove UDP :8125 port mapping from otel-collector service
|
|
- Update rippled service config: change `[insight] server=statsd` to `server=otel`
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/otel-collector-config.yaml`
|
|
- `docker/telemetry/docker-compose.yml`
|
|
|
|
**Note**: Keep a commented-out `statsd` receiver block for operators who need backward compatibility.
|
|
|
|
---
|
|
|
|
## Task 7.5: Preserve Metric Names in Prometheus
|
|
|
|
**Objective**: Ensure existing Grafana dashboards continue working with identical metric names.
|
|
|
|
**What to do**:
|
|
|
|
- In `OTelCollector.cpp`, construct OTel instrument names to match existing Prometheus metric names:
|
|
- beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `rippled_LedgerMaster_Validated_Ledger_Age`
|
|
- The prefix + group + name concatenation must produce the same string as `StatsDCollector`'s format
|
|
- Use underscores as separators (matching StatsD convention)
|
|
|
|
- Verify in integration test that key Prometheus queries still return data:
|
|
- `rippled_LedgerMaster_Validated_Ledger_Age`
|
|
- `rippled_Peer_Finder_Active_Inbound_Peers`
|
|
- `rippled_rpc_requests`
|
|
|
|
**Key consideration**: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds `_total` suffix to counters and converts dots to underscores — match existing conventions.
|
|
|
|
---
|
|
|
|
## Task 7.6: Update Grafana Dashboards
|
|
|
|
**Objective**: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.
|
|
|
|
**What to do**:
|
|
|
|
- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
|
|
- If OTLP export produces different names (e.g., `_total` suffix on counters), update:
|
|
- `docker/telemetry/grafana/dashboards/statsd-node-health.json`
|
|
- `docker/telemetry/grafana/dashboards/statsd-network-traffic.json`
|
|
- `docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json`
|
|
- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/grafana/dashboards/statsd-*.json` (3 files, conditionally)
|
|
|
|
---
|
|
|
|
## Task 7.7: Update Integration Tests
|
|
|
|
**Objective**: Verify the full OTLP metrics pipeline end-to-end.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docker/telemetry/integration-test.sh`:
|
|
- Update test config to use `[insight] server=otel`
|
|
- Verify metrics arrive in Prometheus via OTLP (not StatsD)
|
|
- Add check that StatsD receiver is no longer required
|
|
- Preserve all existing metric presence checks
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/integration-test.sh`
|
|
|
|
---
|
|
|
|
## Task 7.8: Update Documentation
|
|
|
|
**Objective**: Update all plan docs, runbook, and reference docs to reflect the migration.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docs/telemetry-runbook.md`:
|
|
- Update `[insight]` config examples to show `server=otel`
|
|
- Update troubleshooting section (no more StatsD UDP debugging)
|
|
|
|
- Edit `OpenTelemetryPlan/09-data-collection-reference.md`:
|
|
- Update Data Flow Overview diagram (remove StatsD receiver)
|
|
- Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
|
|
- Update config examples
|
|
|
|
- Edit `OpenTelemetryPlan/05-configuration-reference.md`:
|
|
- Add `server=otel` option to `[insight]` section docs
|
|
|
|
- Edit `docker/telemetry/TESTING.md`:
|
|
- Update setup instructions to use `server=otel`
|
|
|
|
**Key modified files**:
|
|
|
|
- `docs/telemetry-runbook.md`
|
|
- `OpenTelemetryPlan/09-data-collection-reference.md`
|
|
- `OpenTelemetryPlan/05-configuration-reference.md`
|
|
- `docker/telemetry/TESTING.md`
|
|
|
|
---
|
|
|
|
## Summary Table
|
|
|
|
| Task | Description | New Files | Modified Files | Effort | Risk | Depends On |
|
|
| ---- | -------------------------------------- | --------- | -------------- | ------ | ------ | ---------- |
|
|
| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | 0.5d | Low | — |
|
|
| 7.2 | Implement OTelCollector class | 2 | 0 | 3d | Medium | 7.1 |
|
|
| 7.3 | Update CollectorManager config routing | 0 | 2 | 0.5d | Low | 7.2 |
|
|
| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 0.5d | Low | 7.3 |
|
|
| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 1d | Medium | 7.2 |
|
|
| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 1d | Low | 7.5 |
|
|
| 7.7 | Update integration tests | 0 | 1 | 0.5d | Low | 7.4 |
|
|
| 7.8 | Update documentation | 0 | 4 | 1d | Low | 7.6 |
|
|
|
|
**Total Effort**: 8 days
|
|
|
|
**Parallel work**: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6.
|
|
|
|
**Exit Criteria** (from [06-implementation-phases.md §6.8](./06-implementation-phases.md)):
|
|
|
|
- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
|
|
- [ ] `server=otel` is the default in development docker-compose
|
|
- [ ] `server=statsd` still works as a fallback
|
|
- [ ] Existing Grafana dashboards display data correctly
|
|
- [ ] Integration test passes with OTLP-only metrics pipeline
|
|
- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead)
|
|
- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant — Meter mapped to OTel Counter
|