Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl behind the existing beast::insight::Collector interface. Maps Counter, Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to same collector endpoint as traces. Eliminates StatsD UDP dependency. Resolves deferred Phase 6 Task 6.1 (|m wire format). Phase 8 (log correlation): Inject trace_id/span_id into JLOG output via Logs::format() thread-local span context read. Add Grafana Loki with OTel Collector filelog receiver for centralized log ingestion. Enable bidirectional Tempo-Loki correlation in Grafana. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
19 KiB
Phase 7: Native OTel Metrics Migration — Task List
Goal: Replace
StatsDCollectorwith a native OpenTelemetry Metrics SDK implementation behind the existingbeast::insight::Collectorinterface, eliminating the StatsD UDP dependency.Scope: New
OTelCollectorImplclass,CollectorManagerconfig change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.Branch:
pratik/otel-phase7-native-metrics(frompratik/otel-phase6-statsd)
Related Plan Documents
| Document | Relevance |
|---|---|
| 02-design-decisions.md | Collector interface design, beast::insight coexistence strategy (replaced) |
| 05-configuration-reference.md | [insight] and [telemetry] config sections |
| 06-implementation-phases.md | Phase 7 summary, exit criteria, success metrics (§6.8) |
| 09-data-collection-reference.md | Complete metric inventory that must be preserved |
Motivation: Why Migrate from StatsD to Native OTel Metrics
What We Gain
-
Unified telemetry pipeline — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."
-
Eliminates StatsD UDP limitations — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.
-
Fixes the
|mwire format issue — TheStatsDMeterImpluses non-standard|mStatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). -
Richer metric semantics — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.
-
Removes infrastructure dependency — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.
-
Metric-to-trace correlation — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics.
-
Production-grade export — OTel's
PeriodicMetricReaderprovides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled inStatsDCollectorImp.
What We Lose
-
StatsD ecosystem compatibility — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep
server=statsdas a fallback. -
Simplicity of UDP — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.
-
Slightly higher memory — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.
-
Dependency on OTel C++ Metrics SDK stability — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.
Decision
The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. StatsDCollector is retained as a fallback via server=statsd for operators who need StatsD ecosystem compatibility during the transition period.
Architecture
Class Hierarchy (after Phase 7)
beast::insight::Collector (abstract interface — unchanged)
|
+-- StatsDCollector (existing — retained as fallback, deprecated)
| +-- StatsDCounterImpl -> StatsD |c over UDP
| +-- StatsDGaugeImpl -> StatsD |g over UDP
| +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard)
| +-- StatsDEventImpl -> StatsD |ms over UDP
| +-- StatsDHookImpl -> 1s periodic callback
|
+-- NullCollector (existing — unchanged, used when disabled)
| +-- NullCounterImpl -> no-op
| +-- NullGaugeImpl -> no-op
| +-- NullMeterImpl -> no-op
| +-- NullEventImpl -> no-op
| +-- NullHookImpl -> no-op
|
+-- OTelCollector (NEW — Phase 7)
+-- OTelCounterImpl -> otel::Counter<int64_t>
+-- OTelGaugeImpl -> otel::ObservableGauge<uint64_t>
+-- OTelMeterImpl -> otel::Counter<uint64_t>
+-- OTelEventImpl -> otel::Histogram<double>
+-- OTelHookImpl -> 1s periodic callback (same pattern)
Instrument Type Mapping
| beast::insight Type | OTel Metrics SDK Instrument | Rationale |
|---|---|---|
Counter (int64, delta, |c) |
Counter<int64_t> |
Direct 1:1 — both are monotonic delta counters |
Gauge (uint64, current value, |g) |
ObservableGauge<uint64_t> with async callback |
OTel gauges use async observation via callbacks; the existing Hook pattern already provides periodic polling |
Meter (uint64, increment-only, |m) |
Counter<uint64_t> |
Meters are semantically counters — this fixes the non-standard |m wire format issue from Phase 6 Task 6.1 |
Event (ms duration, |ms) |
Histogram<double> with explicit buckets |
Duration distributions — use same buckets as SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms |
Hook (1s periodic callback) |
PeriodicMetricReader callback alignment |
Collection interval matches existing 1s period; hooks fire gauge observations before export |
Data Flow (after Phase 7)
graph LR
subgraph rippledNode["rippled Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN"]
B["beast::insight<br/>OTelCollector"]
end
subgraph collector["OTel Collector :4317 / :4318"]
direction TB
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
BP["Batch Processor"]
SM["SpanMetrics Connector"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backends"]
D["Jaeger / Tempo"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000"]
end
A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1
BP -->|"OTLP/gRPC"| D
SM -->|"RED metrics"| E
R1 -->|"rippled_* metrics<br/>(native OTLP)"| E
E --> F
D --> F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#d9534f,color:#fff,stroke:#b52d2d
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
Key change: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.
Configuration
# [insight] section — new "otel" server option
[insight]
server=otel # NEW: uses OTel OTLP metrics exporter
prefix=rippled # metric name prefix (preserved)
# Endpoint and auth inherited from [telemetry] section:
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
The OTelCollector reads the OTLP endpoint from [telemetry] config (replacing /v1/traces with /v1/metrics for the metrics exporter). No additional config keys needed.
Backward compatibility: server=statsd continues to work exactly as before.
Task 7.1: Add OTel Metrics SDK to Build Dependencies
Objective: Enable the OTel C++ Metrics SDK components in the build system.
What to do:
-
Edit
conanfile.py:- Add OTel metrics SDK components to the dependency list when
telemetry=True - Components needed:
opentelemetry-cpp::metrics,opentelemetry-cpp::otlp_http_metric_exporter
- Add OTel metrics SDK components to the dependency list when
-
Edit
CMakeLists.txt(telemetry section):- Link
opentelemetry::metricsandopentelemetry::otlp_http_metric_exportertargets
- Link
Key modified files:
conanfile.pyCMakeLists.txt(or the relevant telemetry cmake target)
Reference: 05-configuration-reference.md §5.3 — CMake integration
Task 7.2: Implement OTelCollector Class
Objective: Create the core OTelCollector implementation that maps beast::insight instruments to OTel Metrics SDK instruments.
What to do:
-
Create
include/xrpl/beast/insight/OTelCollector.h:- Public factory:
static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal) - Derives from
StatsDCollector(or directly fromCollector— TBD based on shared code)
- Public factory:
-
Create
src/libxrpl/beast/insight/OTelCollector.cpp(~400-500 lines):- OTelCounterImpl: Wraps
opentelemetry::metrics::Counter<int64_t>.increment(amount)callscounter->Add(amount). - OTelGaugeImpl: Uses
opentelemetry::metrics::ObservableGauge<uint64_t>with an async callback.set(value)stores value atomically; callback reads it during collection. - OTelMeterImpl: Wraps
opentelemetry::metrics::Counter<uint64_t>.increment(amount)callscounter->Add(amount). Semantically identical to Counter but unsigned. - OTelEventImpl: Wraps
opentelemetry::metrics::Histogram<double>.notify(duration)callshistogram->Record(duration.count()). Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms. - OTelHookImpl: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
- OTelCollectorImp: Main class.
- Creates
MeterProviderwithPeriodicMetricReader(1s export interval) - Creates
OtlpHttpMetricExporterpointing to[telemetry]endpoint - Sets resource attributes (service.name, service.instance.id) matching trace exporter
- Implements all
make_*()factory methods - Prefixes metric names with
[insight] prefix=value
- Creates
- OTelCounterImpl: Wraps
-
Guard all OTel SDK includes with
#ifdef XRPL_ENABLE_TELEMETRYto compile toNullCollectorequivalents when telemetry disabled.
Key new files:
include/xrpl/beast/insight/OTelCollector.hsrc/libxrpl/beast/insight/OTelCollector.cpp
Key patterns to follow:
- Match
StatsDCollector.cppstructure: private impl classes, intrusive list for metrics, strand-based thread safety - Match existing telemetry code style from
src/libxrpl/telemetry/Telemetry.cpp - Use RAII for MeterProvider lifecycle (shutdown on destructor)
Reference: 04-code-samples.md — code style and patterns
Task 7.3: Update CollectorManager
Objective: Add server=otel config option to route metric creation to the new OTel backend.
What to do:
-
Edit
src/xrpld/app/main/CollectorManager.cpp:- In the constructor, add a third branch after
server == "statsd":else if (server == "otel") { // Read endpoint from [telemetry] section auto const endpoint = get(telemetryParams, "endpoint", "http://localhost:4318/v1/metrics"); std::string const& prefix(get(params, "prefix")); m_collector = beast::insight::OTelCollector::New( endpoint, prefix, journal); } - This requires access to the
[telemetry]config section — may need to pass it as a parameter or read from Application config.
- In the constructor, add a third branch after
-
Edit
src/xrpld/app/main/CollectorManager.h:- Add
#include <xrpl/beast/insight/OTelCollector.h>
- Add
Key modified files:
src/xrpld/app/main/CollectorManager.cppsrc/xrpld/app/main/CollectorManager.h
Task 7.4: Update OTel Collector Configuration
Objective: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.
What to do:
-
Edit
docker/telemetry/otel-collector-config.yaml:- Remove
statsdreceiver (no longer needed whenserver=otel) - Add metrics pipeline under
service.pipelines:metrics: receivers: [otlp, spanmetrics] processors: [batch] exporters: [prometheus] - The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
- Keep
spanmetricsconnector in the metrics pipeline so span-derived RED metrics continue working.
- Remove
-
Edit
docker/telemetry/docker-compose.yml:- Remove UDP :8125 port mapping from otel-collector service
- Update rippled service config: change
[insight] server=statsdtoserver=otel
Key modified files:
docker/telemetry/otel-collector-config.yamldocker/telemetry/docker-compose.yml
Note: Keep a commented-out statsd receiver block for operators who need backward compatibility.
Task 7.5: Preserve Metric Names in Prometheus
Objective: Ensure existing Grafana dashboards continue working with identical metric names.
What to do:
-
In
OTelCollector.cpp, construct OTel instrument names to match existing Prometheus metric names:- beast::insight
make_gauge("LedgerMaster", "Validated_Ledger_Age")→ OTel instrument name:rippled_LedgerMaster_Validated_Ledger_Age - The prefix + group + name concatenation must produce the same string as
StatsDCollector's format - Use underscores as separators (matching StatsD convention)
- beast::insight
-
Verify in integration test that key Prometheus queries still return data:
rippled_LedgerMaster_Validated_Ledger_Agerippled_Peer_Finder_Active_Inbound_Peersrippled_rpc_requests
Key consideration: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds _total suffix to counters and converts dots to underscores — match existing conventions.
Task 7.6: Update Grafana Dashboards
Objective: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.
What to do:
- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
- If OTLP export produces different names (e.g.,
_totalsuffix on counters), update:docker/telemetry/grafana/dashboards/statsd-node-health.jsondocker/telemetry/grafana/dashboards/statsd-network-traffic.jsondocker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).
Key modified files:
docker/telemetry/grafana/dashboards/statsd-*.json(3 files, conditionally)
Task 7.7: Update Integration Tests
Objective: Verify the full OTLP metrics pipeline end-to-end.
What to do:
- Edit
docker/telemetry/integration-test.sh:- Update test config to use
[insight] server=otel - Verify metrics arrive in Prometheus via OTLP (not StatsD)
- Add check that StatsD receiver is no longer required
- Preserve all existing metric presence checks
- Update test config to use
Key modified files:
docker/telemetry/integration-test.sh
Task 7.8: Update Documentation
Objective: Update all plan docs, runbook, and reference docs to reflect the migration.
What to do:
-
Edit
docs/telemetry-runbook.md:- Update
[insight]config examples to showserver=otel - Update troubleshooting section (no more StatsD UDP debugging)
- Update
-
Edit
OpenTelemetryPlan/09-data-collection-reference.md:- Update Data Flow Overview diagram (remove StatsD receiver)
- Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
- Update config examples
-
Edit
OpenTelemetryPlan/05-configuration-reference.md:- Add
server=oteloption to[insight]section docs
- Add
-
Edit
docker/telemetry/TESTING.md:- Update setup instructions to use
server=otel
- Update setup instructions to use
Key modified files:
docs/telemetry-runbook.mdOpenTelemetryPlan/09-data-collection-reference.mdOpenTelemetryPlan/05-configuration-reference.mddocker/telemetry/TESTING.md
Summary Table
| Task | Description | New Files | Modified Files | Effort | Risk | Depends On |
|---|---|---|---|---|---|---|
| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | 0.5d | Low | — |
| 7.2 | Implement OTelCollector class | 2 | 0 | 3d | Medium | 7.1 |
| 7.3 | Update CollectorManager config routing | 0 | 2 | 0.5d | Low | 7.2 |
| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 0.5d | Low | 7.3 |
| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 1d | Medium | 7.2 |
| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 1d | Low | 7.5 |
| 7.7 | Update integration tests | 0 | 1 | 0.5d | Low | 7.4 |
| 7.8 | Update documentation | 0 | 4 | 1d | Low | 7.6 |
Total Effort: 8 days
Parallel work: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6.
Exit Criteria (from 06-implementation-phases.md §6.8):
- All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
server=otelis the default in development docker-composeserver=statsdstill works as a fallback- Existing Grafana dashboards display data correctly
- Integration test passes with OTLP-only metrics pipeline
- No performance regression vs StatsD baseline (< 1% CPU overhead)
- Deferred Task 6.1 (
|mwire format) no longer relevant — Meter mapped to OTel Counter