Files
rippled/OpenTelemetryPlan/Phase7_taskList.md
Pratik Mankawde 85a2220312 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00

19 KiB

Phase 7: Native OTel Metrics Migration — Task List

Goal: Replace StatsDCollector with a native OpenTelemetry Metrics SDK implementation behind the existing beast::insight::Collector interface, eliminating the StatsD UDP dependency.

Scope: New OTelCollectorImpl class, CollectorManager config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests.

Branch: pratik/otel-phase7-native-metrics (from pratik/otel-phase6-statsd)

Document Relevance
02-design-decisions.md Collector interface design, beast::insight coexistence strategy (replaced)
05-configuration-reference.md [insight] and [telemetry] config sections
06-implementation-phases.md Phase 7 summary, exit criteria, success metrics (§6.8)
09-data-collection-reference.md Complete metric inventory that must be preserved

Motivation: Why Migrate from StatsD to Native OTel Metrics

What We Gain

  1. Unified telemetry pipeline — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."

  2. Eliminates StatsD UDP limitations — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.

  3. Fixes the |m wire format issue — The StatsDMeterImpl uses non-standard |m StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved).

  4. Richer metric semantics — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.

  5. Removes infrastructure dependency — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.

  6. Metric-to-trace correlation — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics.

  7. Production-grade export — OTel's PeriodicMetricReader provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in StatsDCollectorImp.

What We Lose

  1. StatsD ecosystem compatibility — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep server=statsd as a fallback.

  2. Simplicity of UDP — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.

  3. Slightly higher memory — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.

  4. Dependency on OTel C++ Metrics SDK stability — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.

Decision

The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. StatsDCollector is retained as a fallback via server=statsd for operators who need StatsD ecosystem compatibility during the transition period.


Architecture

Class Hierarchy (after Phase 7)

beast::insight::Collector (abstract interface — unchanged)
    |
    +-- StatsDCollector        (existing — retained as fallback, deprecated)
    |     +-- StatsDCounterImpl    -> StatsD |c over UDP
    |     +-- StatsDGaugeImpl      -> StatsD |g over UDP
    |     +-- StatsDMeterImpl      -> StatsD |m over UDP (non-standard)
    |     +-- StatsDEventImpl      -> StatsD |ms over UDP
    |     +-- StatsDHookImpl       -> 1s periodic callback
    |
    +-- NullCollector          (existing — unchanged, used when disabled)
    |     +-- NullCounterImpl      -> no-op
    |     +-- NullGaugeImpl        -> no-op
    |     +-- NullMeterImpl        -> no-op
    |     +-- NullEventImpl        -> no-op
    |     +-- NullHookImpl         -> no-op
    |
    +-- OTelCollector          (NEW — Phase 7)
          +-- OTelCounterImpl      -> otel::Counter<int64_t>
          +-- OTelGaugeImpl        -> otel::ObservableGauge<uint64_t>
          +-- OTelMeterImpl        -> otel::Counter<uint64_t>
          +-- OTelEventImpl        -> otel::Histogram<double>
          +-- OTelHookImpl         -> 1s periodic callback (same pattern)

Instrument Type Mapping

beast::insight Type OTel Metrics SDK Instrument Rationale
Counter (int64, delta, |c) Counter<int64_t> Direct 1:1 — both are monotonic delta counters
Gauge (uint64, current value, |g) ObservableGauge<uint64_t> with async callback OTel gauges use async observation via callbacks; the existing Hook pattern already provides periodic polling
Meter (uint64, increment-only, |m) Counter<uint64_t> Meters are semantically counters — this fixes the non-standard |m wire format issue from Phase 6 Task 6.1
Event (ms duration, |ms) Histogram<double> with explicit buckets Duration distributions — use same buckets as SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms
Hook (1s periodic callback) PeriodicMetricReader callback alignment Collection interval matches existing 1s period; hooks fire gauge observations before export

Data Flow (after Phase 7)

graph LR
    subgraph rippledNode["rippled Node"]
        A["Trace Macros<br/>XRPL_TRACE_SPAN"]
        B["beast::insight<br/>OTelCollector"]
    end

    subgraph collector["OTel Collector  :4317 / :4318"]
        direction TB
        R1["OTLP Receiver<br/>:4317 gRPC  |  :4318 HTTP"]
        BP["Batch Processor"]
        SM["SpanMetrics Connector"]

        R1 --> BP
        BP --> SM
    end

    subgraph backends["Trace Backends"]
        D["Jaeger / Tempo"]
    end

    subgraph metrics["Metrics Stack"]
        E["Prometheus  :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
    end

    subgraph viz["Visualization"]
        F["Grafana  :3000"]
    end

    A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
    B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1

    BP -->|"OTLP/gRPC"| D
    SM -->|"RED metrics"| E
    R1 -->|"rippled_* metrics<br/>(native OTLP)"| E

    E --> F
    D --> F

    style A fill:#4a90d9,color:#fff,stroke:#2a6db5
    style B fill:#d9534f,color:#fff,stroke:#b52d2d
    style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
    style BP fill:#449d44,color:#fff,stroke:#2d6e2d
    style SM fill:#449d44,color:#fff,stroke:#2d6e2d
    style D fill:#f0ad4e,color:#000,stroke:#c78c2e
    style E fill:#f0ad4e,color:#000,stroke:#c78c2e
    style F fill:#5bc0de,color:#000,stroke:#3aa8c1
    style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
    style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
    style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de

Key change: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.

Configuration

# [insight] section — new "otel" server option
[insight]
server=otel              # NEW: uses OTel OTLP metrics exporter
prefix=rippled           # metric name prefix (preserved)

# Endpoint and auth inherited from [telemetry] section:
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces

The OTelCollector reads the OTLP endpoint from [telemetry] config (replacing /v1/traces with /v1/metrics for the metrics exporter). No additional config keys needed.

Backward compatibility: server=statsd continues to work exactly as before.


Task 7.1: Add OTel Metrics SDK to Build Dependencies

Objective: Enable the OTel C++ Metrics SDK components in the build system.

What to do:

  • Edit conanfile.py:

    • Add OTel metrics SDK components to the dependency list when telemetry=True
    • Components needed: opentelemetry-cpp::metrics, opentelemetry-cpp::otlp_http_metric_exporter
  • Edit CMakeLists.txt (telemetry section):

    • Link opentelemetry::metrics and opentelemetry::otlp_http_metric_exporter targets

Key modified files:

  • conanfile.py
  • CMakeLists.txt (or the relevant telemetry cmake target)

Reference: 05-configuration-reference.md §5.3 — CMake integration


Task 7.2: Implement OTelCollector Class

Objective: Create the core OTelCollector implementation that maps beast::insight instruments to OTel Metrics SDK instruments.

What to do:

  • Create include/xrpl/beast/insight/OTelCollector.h:

    • Public factory: static std::shared_ptr<OTelCollector> New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)
    • Derives from StatsDCollector (or directly from Collector — TBD based on shared code)
  • Create src/libxrpl/beast/insight/OTelCollector.cpp (~400-500 lines):

    • OTelCounterImpl: Wraps opentelemetry::metrics::Counter<int64_t>. increment(amount) calls counter->Add(amount).
    • OTelGaugeImpl: Uses opentelemetry::metrics::ObservableGauge<uint64_t> with an async callback. set(value) stores value atomically; callback reads it during collection.
    • OTelMeterImpl: Wraps opentelemetry::metrics::Counter<uint64_t>. increment(amount) calls counter->Add(amount). Semantically identical to Counter but unsigned.
    • OTelEventImpl: Wraps opentelemetry::metrics::Histogram<double>. notify(duration) calls histogram->Record(duration.count()). Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms.
    • OTelHookImpl: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader).
    • OTelCollectorImp: Main class.
      • Creates MeterProvider with PeriodicMetricReader (1s export interval)
      • Creates OtlpHttpMetricExporter pointing to [telemetry] endpoint
      • Sets resource attributes (service.name, service.instance.id) matching trace exporter
      • Implements all make_*() factory methods
      • Prefixes metric names with [insight] prefix= value
  • Guard all OTel SDK includes with #ifdef XRPL_ENABLE_TELEMETRY to compile to NullCollector equivalents when telemetry disabled.

Key new files:

  • include/xrpl/beast/insight/OTelCollector.h
  • src/libxrpl/beast/insight/OTelCollector.cpp

Key patterns to follow:

  • Match StatsDCollector.cpp structure: private impl classes, intrusive list for metrics, strand-based thread safety
  • Match existing telemetry code style from src/libxrpl/telemetry/Telemetry.cpp
  • Use RAII for MeterProvider lifecycle (shutdown on destructor)

Reference: 04-code-samples.md — code style and patterns


Task 7.3: Update CollectorManager

Objective: Add server=otel config option to route metric creation to the new OTel backend.

What to do:

  • Edit src/xrpld/app/main/CollectorManager.cpp:

    • In the constructor, add a third branch after server == "statsd":
      else if (server == "otel")
      {
          // Read endpoint from [telemetry] section
          auto const endpoint = get(telemetryParams, "endpoint",
              "http://localhost:4318/v1/metrics");
          std::string const& prefix(get(params, "prefix"));
          m_collector = beast::insight::OTelCollector::New(
              endpoint, prefix, journal);
      }
      
    • This requires access to the [telemetry] config section — may need to pass it as a parameter or read from Application config.
  • Edit src/xrpld/app/main/CollectorManager.h:

    • Add #include <xrpl/beast/insight/OTelCollector.h>

Key modified files:

  • src/xrpld/app/main/CollectorManager.cpp
  • src/xrpld/app/main/CollectorManager.h

Task 7.4: Update OTel Collector Configuration

Objective: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency.

What to do:

  • Edit docker/telemetry/otel-collector-config.yaml:

    • Remove statsd receiver (no longer needed when server=otel)
    • Add metrics pipeline under service.pipelines:
      metrics:
        receivers: [otlp, spanmetrics]
        processors: [batch]
        exporters: [prometheus]
      
    • The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers.
    • Keep spanmetrics connector in the metrics pipeline so span-derived RED metrics continue working.
  • Edit docker/telemetry/docker-compose.yml:

    • Remove UDP :8125 port mapping from otel-collector service
    • Update rippled service config: change [insight] server=statsd to server=otel

Key modified files:

  • docker/telemetry/otel-collector-config.yaml
  • docker/telemetry/docker-compose.yml

Note: Keep a commented-out statsd receiver block for operators who need backward compatibility.


Task 7.5: Preserve Metric Names in Prometheus

Objective: Ensure existing Grafana dashboards continue working with identical metric names.

What to do:

  • In OTelCollector.cpp, construct OTel instrument names to match existing Prometheus metric names:

    • beast::insight make_gauge("LedgerMaster", "Validated_Ledger_Age") → OTel instrument name: rippled_LedgerMaster_Validated_Ledger_Age
    • The prefix + group + name concatenation must produce the same string as StatsDCollector's format
    • Use underscores as separators (matching StatsD convention)
  • Verify in integration test that key Prometheus queries still return data:

    • rippled_LedgerMaster_Validated_Ledger_Age
    • rippled_Peer_Finder_Active_Inbound_Peers
    • rippled_rpc_requests

Key consideration: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds _total suffix to counters and converts dots to underscores — match existing conventions.


Task 7.6: Update Grafana Dashboards

Objective: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences.

What to do:

  • If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed.
  • If OTLP export produces different names (e.g., _total suffix on counters), update:
    • docker/telemetry/grafana/dashboards/statsd-node-health.json
    • docker/telemetry/grafana/dashboards/statsd-network-traffic.json
    • docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
  • Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced).

Key modified files:

  • docker/telemetry/grafana/dashboards/statsd-*.json (3 files, conditionally)

Task 7.7: Update Integration Tests

Objective: Verify the full OTLP metrics pipeline end-to-end.

What to do:

  • Edit docker/telemetry/integration-test.sh:
    • Update test config to use [insight] server=otel
    • Verify metrics arrive in Prometheus via OTLP (not StatsD)
    • Add check that StatsD receiver is no longer required
    • Preserve all existing metric presence checks

Key modified files:

  • docker/telemetry/integration-test.sh

Task 7.8: Update Documentation

Objective: Update all plan docs, runbook, and reference docs to reflect the migration.

What to do:

  • Edit docs/telemetry-runbook.md:

    • Update [insight] config examples to show server=otel
    • Update troubleshooting section (no more StatsD UDP debugging)
  • Edit OpenTelemetryPlan/09-data-collection-reference.md:

    • Update Data Flow Overview diagram (remove StatsD receiver)
    • Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)"
    • Update config examples
  • Edit OpenTelemetryPlan/05-configuration-reference.md:

    • Add server=otel option to [insight] section docs
  • Edit docker/telemetry/TESTING.md:

    • Update setup instructions to use server=otel

Key modified files:

  • docs/telemetry-runbook.md
  • OpenTelemetryPlan/09-data-collection-reference.md
  • OpenTelemetryPlan/05-configuration-reference.md
  • docker/telemetry/TESTING.md

Summary Table

Task Description New Files Modified Files Effort Risk Depends On
7.1 Add OTel Metrics SDK to build deps 0 2 0.5d Low
7.2 Implement OTelCollector class 2 0 3d Medium 7.1
7.3 Update CollectorManager config routing 0 2 0.5d Low 7.2
7.4 Update OTel Collector YAML and Docker 0 2 0.5d Low 7.3
7.5 Preserve metric names in Prometheus 0 1 1d Medium 7.2
7.6 Update Grafana dashboards (if needed) 0 3 1d Low 7.5
7.7 Update integration tests 0 1 0.5d Low 7.4
7.8 Update documentation 0 4 1d Low 7.6

Total Effort: 8 days

Parallel work: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6.

Exit Criteria (from 06-implementation-phases.md §6.8):

  • All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
  • server=otel is the default in development docker-compose
  • server=statsd still works as a fallback
  • Existing Grafana dashboards display data correctly
  • Integration test passes with OTLP-only metrics pipeline
  • No performance regression vs StatsD baseline (< 1% CPU overhead)
  • Deferred Task 6.1 (|m wire format) no longer relevant — Meter mapped to OTel Counter