diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index 410ccef7c0..5eb41d0160 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -447,6 +447,8 @@ span->SetAttribute("peer.id", peerId); ### 2.6.4 Coexistence Strategy +> **Note**: Phase 7 replaces the StatsD bridge with native OTel Metrics SDK export. The diagram below shows the Phase 6 intermediate state. See [Phase7_taskList.md](./Phase7_taskList.md) for the migration design where Beast Insight emits via OTLP instead of StatsD. + ```mermaid flowchart TB subgraph rippled["rippled Process"] @@ -467,6 +469,8 @@ flowchart TB style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff ``` +**Phase 7 target state**: Beast Insight routes to `OTelCollector` (new `Collector` implementation) which exports via OTLP/HTTP to the same collector endpoint as traces. StatsD UDP path becomes a deprecated fallback (`[insight] server=statsd`). See [06-implementation-phases.md §6.8](./06-implementation-phases.md) and [Phase7_taskList.md](./Phase7_taskList.md) for details. + ### 2.6.5 Correlation with PerfLog Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging: diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 8ab0446391..353ba4a2ad 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -344,7 +344,95 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi - [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`) - [ ] All 3 new Grafana dashboards load without errors - [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests) -- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately) +- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping) + +--- + +## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12) + +**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. + +### Why Migrate + +The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations: UDP fire-and-forget with no delivery guarantees, non-standard `|m` wire format, 1472-byte MTU fragmentation, and a split-brain architecture where traces use OTLP but metrics use StatsD. Phase 7 resolves all of these by implementing a new `OTelCollectorImpl` behind the unchanged `beast::insight::Collector` interface — zero changes at call sites. + +**What we gain**: Unified OTLP pipeline, delivery guarantees, metric-trace correlation via shared resource attributes, explicit histogram buckets, simpler collector config (no StatsD receiver), and the `|m` meter issue is resolved by mapping to OTel `Counter`. + +**What we lose**: StatsD ecosystem compatibility (mitigated: `server=statsd` retained as fallback), slightly higher memory (~1-2 MB for OTel aggregation state), and dependency on OTel C++ Metrics SDK stability (mitigated: SDK 1.18.0 is GA). + +See [Phase7_taskList.md](./Phase7_taskList.md) for full rationale, architecture diagrams, and detailed task breakdown. + +### Instrument Type Mapping + +| beast::insight | OTel Metrics SDK | Rationale | +| ---------------------- | -------------------------------- | ---------------------------------------------------------------- | +| Counter (int64, `\|c`) | `Counter` | Direct 1:1 mapping | +| Gauge (uint64, `\|g`) | `ObservableGauge` | Async callback matches existing Hook polling pattern | +| Meter (uint64, `\|m`) | `Counter` | Fixes non-standard wire format; meters are semantically counters | +| Event (ms, `\|ms`) | `Histogram` | Duration distributions with explicit bucket boundaries | +| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval | + +### Tasks + +| Task | Description | Effort | Risk | +| ---- | ------------------------------------------------------------------------- | ------ | ------ | +| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) | 0.5d | Low | +| 7.2 | Implement `OTelCollector` class (~400-500 lines) | 3d | Medium | +| 7.3 | Update `CollectorManager` — add `server=otel` | 0.5d | Low | +| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) | 0.5d | Low | +| 7.5 | Preserve metric names in Prometheus (naming strategy) | 1d | Medium | +| 7.6 | Update Grafana dashboards (if names change) | 1d | Low | +| 7.7 | Update integration tests | 0.5d | Low | +| 7.8 | Update documentation (runbook, reference docs) | 1d | Low | + +**Total Effort**: 8 days + +### Exit Criteria + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant + +--- + +## 6.8.1 Phase 8: Log-Trace Correlation and Loki Ingestion (Week 13) + +**Objective**: Inject trace context (trace_id, span_id) into rippled's Journal log output and add Grafana Loki as a centralized log backend with bidirectional trace-log correlation in Grafana. + +### Sub-Phases + +**Phase 8a** (code change): Modify `Logs::format()` to read the thread-local OTel span context and prepend `trace_id= span_id=` to every log line emitted within an active span. Zero changes to the ~2,242 JLOG call sites — injection is transparent. + +**Phase 8b** (infra only): Add Grafana Loki to the Docker observability stack and configure the OTel Collector's filelog receiver to parse rippled's log format, extract trace_id, and export to Loki. Configure Grafana's Tempo-to-Loki and Loki-to-Tempo derived field links for one-click correlation. + +See [Phase8_taskList.md](./Phase8_taskList.md) for full motivation, architecture diagrams, and detailed task breakdown. + +### Tasks + +| Task | Description | Sub-Phase | Effort | Risk | +| ---- | ------------------------------------------ | --------- | ------ | ------ | +| 8.1 | Inject trace_id into `Logs::format()` | 8a | 1d | Low | +| 8.2 | Add Loki to Docker Compose stack | 8b | 0.5d | Low | +| 8.3 | Add filelog receiver to OTel Collector | 8b | 1d | Medium | +| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0.5d | Low | +| 8.5 | Update integration tests | 8a + 8b | 0.5d | Low | +| 8.6 | Update documentation | 8a + 8b | 1d | Low | + +**Total Effort**: 4.5 days + +### Exit Criteria + +- [ ] Log lines within active spans contain `trace_id= span_id=` +- [ ] Log lines outside spans have no trace context (clean — no empty fields) +- [ ] Loki ingests rippled logs via OTel Collector filelog receiver +- [ ] Grafana Tempo → Loki one-click correlation works +- [ ] Grafana Loki → Tempo reverse lookup works via derived field +- [ ] Integration test verifies trace_id presence in logs +- [ ] No performance regression from trace_id injection (< 0.1% overhead) --- @@ -623,13 +711,16 @@ Clear, measurable criteria for each phase. ### 6.13.6 Success Metrics Summary -| Phase | Primary Metric | Secondary Metric | Deadline | -| ------- | ---------------------- | --------------------------- | ------------- | -| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | -| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | -| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | -| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | -| Phase 5 | Production deployment | Operators trained | End of Week 9 | +| Phase | Primary Metric | Secondary Metric | Deadline | +| ------- | ---------------------------- | --------------------------- | -------------- | +| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | +| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | +| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | +| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | +| Phase 5 | Production deployment | Operators trained | End of Week 9 | +| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | +| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | +| Phase 8 | trace_id in all logs | Loki ingestion working | End of Week 13 | --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 2298c22d08..5f4175c3b9 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -271,6 +271,8 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er ## 2. StatsD Metrics (beast::insight) > **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory. +> +> **Migration planned**: [Phase7_taskList.md](./Phase7_taskList.md) replaces the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The `beast::insight::Collector` interface and all metric names are preserved — only the wire protocol changes. `[insight] server=statsd` remains as a fallback. These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans. @@ -484,6 +486,12 @@ rippled_State_Accounting_Full_duration --- +## 5a. Future: Log-Trace Correlation (Phase 8) + +> **Planned**: [Phase8_taskList.md](./Phase8_taskList.md) adds `trace_id` and `span_id` to every JLOG log line emitted within an active OTel span. Combined with Grafana Loki ingestion, this enables one-click navigation between traces (Tempo) and logs (Loki). No changes to JLOG call sites — injection is transparent in `Logs::format()`. + +--- + ## 6. Known Issues | Issue | Impact | Status | diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 30159d97bf..833047b844 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -157,17 +157,20 @@ OpenTelemetry Collector configurations are provided for development (with Jaeger ## 6. Implementation Phases -The implementation spans 9 weeks across 5 phases: +The implementation spans 13 weeks across 8 phases: -| Phase | Duration | Focus | Key Deliverables | -| ----- | --------- | ------------------- | --------------------------------------------------- | -| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | -| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | -| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | -| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | -| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | +| Phase | Duration | Focus | Key Deliverables | +| ----- | ----------- | --------------------- | ----------------------------------------------------------- | +| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | +| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | +| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | +| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | +| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | +| 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards | +| 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation | +| 8 | Week 13 | Log-Trace Correlation | trace_id in JLOG output, Loki ingestion, Tempo-Loki linking | -**Total Effort**: 47 developer-days with 2 developers +**Total Effort**: 65.1 developer-days with 2 developers ➡️ **[View full Implementation Phases](./06-implementation-phases.md)** diff --git a/OpenTelemetryPlan/Phase7_taskList.md b/OpenTelemetryPlan/Phase7_taskList.md new file mode 100644 index 0000000000..ea8f05d08b --- /dev/null +++ b/OpenTelemetryPlan/Phase7_taskList.md @@ -0,0 +1,407 @@ +# Phase 7: Native OTel Metrics Migration — Task List + +> **Goal**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency. +> +> **Scope**: New `OTelCollectorImpl` class, `CollectorManager` config change, OTel Collector pipeline update, Grafana dashboard metric name migration, integration tests. +> +> **Branch**: `pratik/otel-phase7-native-metrics` (from `pratik/otel-phase6-statsd`) + +### Related Plan Documents + +| Document | Relevance | +| -------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy (replaced) | +| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 summary, exit criteria, success metrics (§6.8) | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved | + +--- + +## Motivation: Why Migrate from StatsD to Native OTel Metrics + +### What We Gain + +1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." + +2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. + +3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). + +4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. + +5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. + +6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. + +7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. + +### What We Lose + +1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. + +2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. + +3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. + +4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. + +### Decision + +The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. + +--- + +## Architecture + +### Class Hierarchy (after Phase 7) + +``` +beast::insight::Collector (abstract interface — unchanged) + | + +-- StatsDCollector (existing — retained as fallback, deprecated) + | +-- StatsDCounterImpl -> StatsD |c over UDP + | +-- StatsDGaugeImpl -> StatsD |g over UDP + | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) + | +-- StatsDEventImpl -> StatsD |ms over UDP + | +-- StatsDHookImpl -> 1s periodic callback + | + +-- NullCollector (existing — unchanged, used when disabled) + | +-- NullCounterImpl -> no-op + | +-- NullGaugeImpl -> no-op + | +-- NullMeterImpl -> no-op + | +-- NullEventImpl -> no-op + | +-- NullHookImpl -> no-op + | + +-- OTelCollector (NEW — Phase 7) + +-- OTelCounterImpl -> otel::Counter + +-- OTelGaugeImpl -> otel::ObservableGauge + +-- OTelMeterImpl -> otel::Counter + +-- OTelEventImpl -> otel::Histogram + +-- OTelHookImpl -> 1s periodic callback (same pattern) +``` + +### Instrument Type Mapping + +| beast::insight Type | OTel Metrics SDK Instrument | Rationale | +| --------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| `Counter` (int64, delta, `\|c`) | `Counter` | Direct 1:1 — both are monotonic delta counters | +| `Gauge` (uint64, current value, `\|g`) | `ObservableGauge` with async callback | OTel gauges use async observation via callbacks; the existing Hook pattern already provides periodic polling | +| `Meter` (uint64, increment-only, `\|m`) | `Counter` | Meters are semantically counters — this fixes the non-standard `\|m` wire format issue from Phase 6 Task 6.1 | +| `Event` (ms duration, `\|ms`) | `Histogram` with explicit buckets | Duration distributions — use same buckets as SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms | +| `Hook` (1s periodic callback) | `PeriodicMetricReader` callback alignment | Collection interval matches existing 1s period; hooks fire gauge observations before export | + +### Data Flow (after Phase 7) + +```mermaid +graph LR + subgraph rippledNode["rippled Node"] + A["Trace Macros
XRPL_TRACE_SPAN"] + B["beast::insight
OTelCollector"] + end + + subgraph collector["OTel Collector :4317 / :4318"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + BP["Batch Processor"] + SM["SpanMetrics Connector"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends"] + D["Jaeger / Tempo"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000"] + end + + A -->|"OTLP/HTTP :4318
(traces)"| R1 + B -->|"OTLP/HTTP :4318
(metrics)"| R1 + + BP -->|"OTLP/gRPC"| D + SM -->|"RED metrics"| E + R1 -->|"rippled_* metrics
(native OTLP)"| E + + E --> F + D --> F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. + +### Configuration + +```ini +# [insight] section — new "otel" server option +[insight] +server=otel # NEW: uses OTel OTLP metrics exporter +prefix=rippled # metric name prefix (preserved) + +# Endpoint and auth inherited from [telemetry] section: +[telemetry] +enabled=1 +endpoint=http://localhost:4318/v1/traces +``` + +The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. + +**Backward compatibility**: `server=statsd` continues to work exactly as before. + +--- + +## Task 7.1: Add OTel Metrics SDK to Build Dependencies + +**Objective**: Enable the OTel C++ Metrics SDK components in the build system. + +**What to do**: + +- Edit `conanfile.py`: + - Add OTel metrics SDK components to the dependency list when `telemetry=True` + - Components needed: `opentelemetry-cpp::metrics`, `opentelemetry-cpp::otlp_http_metric_exporter` + +- Edit `CMakeLists.txt` (telemetry section): + - Link `opentelemetry::metrics` and `opentelemetry::otlp_http_metric_exporter` targets + +**Key modified files**: + +- `conanfile.py` +- `CMakeLists.txt` (or the relevant telemetry cmake target) + +**Reference**: [05-configuration-reference.md §5.3](./05-configuration-reference.md) — CMake integration + +--- + +## Task 7.2: Implement OTelCollector Class + +**Objective**: Create the core `OTelCollector` implementation that maps beast::insight instruments to OTel Metrics SDK instruments. + +**What to do**: + +- Create `include/xrpl/beast/insight/OTelCollector.h`: + - Public factory: `static std::shared_ptr New(std::string const& endpoint, std::string const& prefix, beast::Journal journal)` + - Derives from `StatsDCollector` (or directly from `Collector` — TBD based on shared code) + +- Create `src/libxrpl/beast/insight/OTelCollector.cpp` (~400-500 lines): + - **OTelCounterImpl**: Wraps `opentelemetry::metrics::Counter`. `increment(amount)` calls `counter->Add(amount)`. + - **OTelGaugeImpl**: Uses `opentelemetry::metrics::ObservableGauge` with an async callback. `set(value)` stores value atomically; callback reads it during collection. + - **OTelMeterImpl**: Wraps `opentelemetry::metrics::Counter`. `increment(amount)` calls `counter->Add(amount)`. Semantically identical to Counter but unsigned. + - **OTelEventImpl**: Wraps `opentelemetry::metrics::Histogram`. `notify(duration)` calls `histogram->Record(duration.count())`. Uses explicit bucket boundaries matching SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms. + - **OTelHookImpl**: Stores handler function. Called during periodic metric collection (same 1s pattern via PeriodicMetricReader). + - **OTelCollectorImp**: Main class. + - Creates `MeterProvider` with `PeriodicMetricReader` (1s export interval) + - Creates `OtlpHttpMetricExporter` pointing to `[telemetry]` endpoint + - Sets resource attributes (service.name, service.instance.id) matching trace exporter + - Implements all `make_*()` factory methods + - Prefixes metric names with `[insight] prefix=` value + +- Guard all OTel SDK includes with `#ifdef XRPL_ENABLE_TELEMETRY` to compile to `NullCollector` equivalents when telemetry disabled. + +**Key new files**: + +- `include/xrpl/beast/insight/OTelCollector.h` +- `src/libxrpl/beast/insight/OTelCollector.cpp` + +**Key patterns to follow**: + +- Match `StatsDCollector.cpp` structure: private impl classes, intrusive list for metrics, strand-based thread safety +- Match existing telemetry code style from `src/libxrpl/telemetry/Telemetry.cpp` +- Use RAII for MeterProvider lifecycle (shutdown on destructor) + +**Reference**: [04-code-samples.md](./04-code-samples.md) — code style and patterns + +--- + +## Task 7.3: Update CollectorManager + +**Objective**: Add `server=otel` config option to route metric creation to the new OTel backend. + +**What to do**: + +- Edit `src/xrpld/app/main/CollectorManager.cpp`: + - In the constructor, add a third branch after `server == "statsd"`: + ```cpp + else if (server == "otel") + { + // Read endpoint from [telemetry] section + auto const endpoint = get(telemetryParams, "endpoint", + "http://localhost:4318/v1/metrics"); + std::string const& prefix(get(params, "prefix")); + m_collector = beast::insight::OTelCollector::New( + endpoint, prefix, journal); + } + ``` + - This requires access to the `[telemetry]` config section — may need to pass it as a parameter or read from Application config. + +- Edit `src/xrpld/app/main/CollectorManager.h`: + - Add `#include ` + +**Key modified files**: + +- `src/xrpld/app/main/CollectorManager.cpp` +- `src/xrpld/app/main/CollectorManager.h` + +--- + +## Task 7.4: Update OTel Collector Configuration + +**Objective**: Add a metrics pipeline to the OTLP receiver and remove the StatsD receiver dependency. + +**What to do**: + +- Edit `docker/telemetry/otel-collector-config.yaml`: + - Remove `statsd` receiver (no longer needed when `server=otel`) + - Add metrics pipeline under `service.pipelines`: + ```yaml + metrics: + receivers: [otlp, spanmetrics] + processors: [batch] + exporters: [prometheus] + ``` + - The OTLP receiver already listens on :4318 — it just needs to be added to the metrics pipeline receivers. + - Keep `spanmetrics` connector in the metrics pipeline so span-derived RED metrics continue working. + +- Edit `docker/telemetry/docker-compose.yml`: + - Remove UDP :8125 port mapping from otel-collector service + - Update rippled service config: change `[insight] server=statsd` to `server=otel` + +**Key modified files**: + +- `docker/telemetry/otel-collector-config.yaml` +- `docker/telemetry/docker-compose.yml` + +**Note**: Keep a commented-out `statsd` receiver block for operators who need backward compatibility. + +--- + +## Task 7.5: Preserve Metric Names in Prometheus + +**Objective**: Ensure existing Grafana dashboards continue working with identical metric names. + +**What to do**: + +- In `OTelCollector.cpp`, construct OTel instrument names to match existing Prometheus metric names: + - beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `rippled_LedgerMaster_Validated_Ledger_Age` + - The prefix + group + name concatenation must produce the same string as `StatsDCollector`'s format + - Use underscores as separators (matching StatsD convention) + +- Verify in integration test that key Prometheus queries still return data: + - `rippled_LedgerMaster_Validated_Ledger_Age` + - `rippled_Peer_Finder_Active_Inbound_Peers` + - `rippled_rpc_requests` + +**Key consideration**: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds `_total` suffix to counters and converts dots to underscores — match existing conventions. + +--- + +## Task 7.6: Update Grafana Dashboards + +**Objective**: Update the 3 StatsD dashboards if any metric names change due to OTLP export format differences. + +**What to do**: + +- If Task 7.5 confirms metric names are preserved exactly, no dashboard changes needed. +- If OTLP export produces different names (e.g., `_total` suffix on counters), update: + - `docker/telemetry/grafana/dashboards/statsd-node-health.json` + - `docker/telemetry/grafana/dashboards/statsd-network-traffic.json` + - `docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json` +- Rename dashboard titles from "StatsD" to "System Metrics" or similar (since they're no longer StatsD-sourced). + +**Key modified files**: + +- `docker/telemetry/grafana/dashboards/statsd-*.json` (3 files, conditionally) + +--- + +## Task 7.7: Update Integration Tests + +**Objective**: Verify the full OTLP metrics pipeline end-to-end. + +**What to do**: + +- Edit `docker/telemetry/integration-test.sh`: + - Update test config to use `[insight] server=otel` + - Verify metrics arrive in Prometheus via OTLP (not StatsD) + - Add check that StatsD receiver is no longer required + - Preserve all existing metric presence checks + +**Key modified files**: + +- `docker/telemetry/integration-test.sh` + +--- + +## Task 7.8: Update Documentation + +**Objective**: Update all plan docs, runbook, and reference docs to reflect the migration. + +**What to do**: + +- Edit `docs/telemetry-runbook.md`: + - Update `[insight]` config examples to show `server=otel` + - Update troubleshooting section (no more StatsD UDP debugging) + +- Edit `OpenTelemetryPlan/09-data-collection-reference.md`: + - Update Data Flow Overview diagram (remove StatsD receiver) + - Update Section 2 header from "StatsD Metrics" to "System Metrics (OTel native)" + - Update config examples + +- Edit `OpenTelemetryPlan/05-configuration-reference.md`: + - Add `server=otel` option to `[insight]` section docs + +- Edit `docker/telemetry/TESTING.md`: + - Update setup instructions to use `server=otel` + +**Key modified files**: + +- `docs/telemetry-runbook.md` +- `OpenTelemetryPlan/09-data-collection-reference.md` +- `OpenTelemetryPlan/05-configuration-reference.md` +- `docker/telemetry/TESTING.md` + +--- + +## Summary Table + +| Task | Description | New Files | Modified Files | Effort | Risk | Depends On | +| ---- | -------------------------------------- | --------- | -------------- | ------ | ------ | ---------- | +| 7.1 | Add OTel Metrics SDK to build deps | 0 | 2 | 0.5d | Low | — | +| 7.2 | Implement OTelCollector class | 2 | 0 | 3d | Medium | 7.1 | +| 7.3 | Update CollectorManager config routing | 0 | 2 | 0.5d | Low | 7.2 | +| 7.4 | Update OTel Collector YAML and Docker | 0 | 2 | 0.5d | Low | 7.3 | +| 7.5 | Preserve metric names in Prometheus | 0 | 1 | 1d | Medium | 7.2 | +| 7.6 | Update Grafana dashboards (if needed) | 0 | 3 | 1d | Low | 7.5 | +| 7.7 | Update integration tests | 0 | 1 | 0.5d | Low | 7.4 | +| 7.8 | Update documentation | 0 | 4 | 1d | Low | 7.6 | + +**Total Effort**: 8 days + +**Parallel work**: Tasks 7.4 and 7.5 can run in parallel after 7.2/7.3 complete. Task 7.6 depends on 7.5's findings. Tasks 7.7 and 7.8 can run in parallel after 7.6. + +**Exit Criteria** (from [06-implementation-phases.md §6.8](./06-implementation-phases.md)): + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant — Meter mapped to OTel Counter diff --git a/OpenTelemetryPlan/Phase8_taskList.md b/OpenTelemetryPlan/Phase8_taskList.md new file mode 100644 index 0000000000..72c54188c3 --- /dev/null +++ b/OpenTelemetryPlan/Phase8_taskList.md @@ -0,0 +1,315 @@ +# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List + +> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability. +> +> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API. +> +> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`) + +### Related Plan Documents + +| Document | Relevance | +| ---------------------------------------------------------------- | ------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 summary and exit criteria (§6.8.1) | +| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning | +| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working | +| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) | + +--- + +## Motivation + +### The Problem + +rippled emits ~2,242 JLOG calls across 194 files. When investigating an issue (e.g., a consensus failure or a slow RPC), operators must manually correlate timestamps between Jaeger traces and log files — a tedious, error-prone process that loses context. + +### What We Gain + +1. **One-click trace-to-log correlation** — Grafana's Tempo → Loki link lets operators click a trace span and see all log lines emitted during that span's lifetime. No manual timestamp matching. + +2. **Log-to-trace reverse lookup** — An error in the logs like `"Transaction failed"` can be traced back to the exact distributed trace via the `trace_id` field. Grafana Loki's LogQL: `{job="rippled"} |= "trace_id=abc123"`. + +3. **Structured log context** — trace_id and span_id make logs queryable by trace, enabling aggregate analysis (e.g., "show all logs for consensus rounds that took >5s"). + +4. **Zero call-site changes** — The injection happens in `Logs::format()` or `Logs::write()`, transparently to the ~2,242 JLOG call sites. + +5. **Centralized log storage** — Loki provides indexed, compressed log retention with LogQL queries, replacing ad-hoc `grep` on individual node log files. + +### What We Lose + +1. **Slightly larger log lines** — Each log line grows by ~70 characters (`trace_id=<32hex> span_id=<16hex>`). At ~1000 lines/min, this adds ~70 KB/min of log volume. + +2. **Thread-local dependency** — trace_id is only available when a span is active on the current thread. JLOG calls outside any span context will have empty trace_id (this is fine — it indicates non-traced code paths). + +3. **Loki infrastructure** — Adds another backend (Loki) to the observability stack. Mitigated: Loki is lightweight and can run as a single binary for development. + +--- + +## Architecture + +### Phase 8a: Trace ID Injection into Logs + +``` +Thread with active OTel span: + JLOG(journal_.info()) << "Processing transaction " << txHash; + +Flow: + 1. ScopedStream destructor calls sink.write(severity, text) + 2. Logs::write() calls Logs::format() + 3. format() checks thread-local OTel span context + 4. If span is active: prepends "trace_id= span_id=" to message + 5. Output: "2026-03-06T12:34:56.789Z Consensus:INF trace_id=abc123 span_id=def456 Processing transaction ..." +``` + +**Key design decision**: Inject in `Logs::format()` (the static formatting function) by accessing the thread-local OTel span context via `opentelemetry::trace::GetSpan(opentelemetry::context::RuntimeContext::GetCurrent())`. This is a read-only, lock-free operation. + +**Conditional**: Only inject when `[telemetry] enabled=1` and a span is active. When telemetry is disabled at build time (`-Dtelemetry=OFF`), the injection compiles to nothing. + +### Phase 8b: Filelog Receiver to Loki + +```mermaid +graph LR + subgraph rippledNode["rippled Node"] + A["JLOG output
with trace_id"] + end + + A -->|"stdout / debug.log"| FL + + subgraph collector["OTel Collector"] + FL["Filelog Receiver
parses rippled log format"] + BP["Batch Processor"] + + FL --> BP + end + + BP -->|"OTLP/gRPC"| L["Grafana Loki
:3100"] + L --> G["Grafana :3000
Explore: Loki + Tempo"] + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style FL fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style L fill:#9c27b0,color:#fff,stroke:#6a1b9a + style G fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c +``` + +**OTel Collector filelog receiver** parses rippled's log format, extracts `trace_id` and `span_id` as log attributes, and exports to Loki via OTLP. Grafana correlates Tempo traces with Loki logs via the shared `trace_id`. + +--- + +## Task 8.1: Inject trace_id into Logs::format() + +**Objective**: Add OTel trace context to every log line that is emitted within an active span. + +**What to do**: + +- Edit `src/libxrpl/basics/Log.cpp`: + - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span: + ```cpp + #ifdef XRPL_ENABLE_TELEMETRY + auto span = opentelemetry::trace::GetSpan( + opentelemetry::context::RuntimeContext::GetCurrent()); + auto ctx = span->GetContext(); + if (ctx.IsValid()) + { + // Append trace context as structured fields + char traceId[33], spanId[17]; + ctx.trace_id().ToLowerBase16(traceId); + ctx.span_id().ToLowerBase16(spanId); + output += "trace_id="; + output.append(traceId, 32); + output += " span_id="; + output.append(spanId, 16); + output += ' '; + } + #endif + ``` + - Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY` + +- Edit `include/xrpl/basics/Log.h`: + - No changes needed — format() signature unchanged + +**Key modified files**: + +- `src/libxrpl/basics/Log.cpp` + +**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead. + +--- + +## Task 8.2: Add Loki to Docker Compose Stack + +**Objective**: Add Grafana Loki as a log storage backend in the development observability stack. + +**What to do**: + +- Edit `docker/telemetry/docker-compose.yml`: + - Add Loki service: + ```yaml + loki: + image: grafana/loki:2.9.0 + ports: + - "3100:3100" + command: -config.file=/etc/loki/local-config.yaml + ``` + - Add Loki as a Grafana data source in provisioning + +- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`: + - Configure Loki data source with derived fields linking `trace_id` to Tempo + +**Key new files**: + +- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` + +**Key modified files**: + +- `docker/telemetry/docker-compose.yml` + +--- + +## Task 8.3: Add Filelog Receiver to OTel Collector + +**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki. + +**What to do**: + +- Edit `docker/telemetry/otel-collector-config.yaml`: + - Add `filelog` receiver: + ```yaml + receivers: + filelog: + include: [/var/log/rippled/debug.log] + operators: + - type: regex_parser + regex: '^(?P\S+)\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' + timestamp: + parse_from: attributes.timestamp + layout: "%Y-%m-%dT%H:%M:%S.%fZ" + ``` + - Add logs pipeline: + ```yaml + service: + pipelines: + logs: + receivers: [filelog] + processors: [batch] + exporters: [otlp/loki] + ``` + - Add Loki exporter: + ```yaml + exporters: + otlp/loki: + endpoint: loki:3100 + tls: + insecure: true + ``` + +- Mount rippled's log directory into the collector container via docker-compose volume + +**Key modified files**: + +- `docker/telemetry/otel-collector-config.yaml` +- `docker/telemetry/docker-compose.yml` + +--- + +## Task 8.4: Configure Grafana Trace-to-Log Correlation + +**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana. + +**What to do**: + +- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration: + + ```yaml + tracesToLogs: + datasourceUid: loki + filterByTraceID: true + filterBySpanID: false + tags: ["partition", "severity"] + ``` + +- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo: + ```yaml + derivedFields: + - datasourceUid: tempo + matcherRegex: "trace_id=(\\w+)" + name: TraceID + url: "$${__value.raw}" + ``` + +**Key modified files**: + +- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` +- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file) + +--- + +## Task 8.5: Update Integration Tests + +**Objective**: Verify trace_id appears in logs and Loki correlation works. + +**What to do**: + +- Edit `docker/telemetry/integration-test.sh`: + - After sending RPC requests (which create spans), grep rippled's log output for `trace_id=` + - Verify trace_id matches a trace visible in Jaeger + - Optionally: query Loki via API to confirm log ingestion + +**Key modified files**: + +- `docker/telemetry/integration-test.sh` + +--- + +## Task 8.6: Update Documentation + +**Objective**: Document the log correlation feature in runbook and reference docs. + +**What to do**: + +- Edit `docs/telemetry-runbook.md`: + - Add "Log-Trace Correlation" section explaining how to use Grafana Tempo → Loki linking + - Add LogQL query examples for filtering by trace_id + +- Edit `OpenTelemetryPlan/09-data-collection-reference.md`: + - Add new section "3. Log Correlation" between SpanMetrics and StatsD sections + - Document the log format with trace_id injection + - Document Loki as a new backend + +- Edit `docker/telemetry/TESTING.md`: + - Add log correlation verification steps + +**Key modified files**: + +- `docs/telemetry-runbook.md` +- `OpenTelemetryPlan/09-data-collection-reference.md` +- `docker/telemetry/TESTING.md` + +--- + +## Summary Table + +| Task | Description | Sub-Phase | New Files | Modified Files | Effort | Risk | Depends On | +| ---- | ------------------------------------------ | --------- | --------- | -------------- | ------ | ------ | ---------- | +| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | 1d | Low | Phase 7 | +| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | 0.5d | Low | — | +| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 1d | Medium | 8.1, 8.2 | +| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 0.5d | Low | 8.3 | +| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 0.5d | Low | 8.4 | +| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 1d | Low | 8.5 | + +**Total Effort**: 4.5 days + +**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential. + +**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)): + +- [ ] Log lines within active spans contain `trace_id= span_id=` +- [ ] Log lines outside spans have no trace context (no empty fields) +- [ ] Loki ingests rippled logs via OTel Collector filelog receiver +- [ ] Grafana Tempo → Loki one-click correlation works +- [ ] Grafana Loki → Tempo reverse lookup works via derived field +- [ ] Integration test verifies trace_id presence in logs +- [ ] No performance regression from trace_id injection (< 0.1% overhead)