From 702cf63c62629904d35c9cd931f3b6b293fb2c36 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Fri, 6 Mar 2026 15:55:47 +0000 Subject: [PATCH] Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Move Phase 7 motivation (gains/losses/decision) and architecture (class hierarchy, data flow diagram, config) from Phase7_taskList.md into 06-implementation-phases.md §6.8 - Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table) - Remove Phase8_taskList.md — belongs on Phase 8 branch - Remove §6.8.1 (Phase 8) from 06-implementation-phases.md - Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md - Remove Phase 8 row from OpenTelemetryPlan.md phase table Co-Authored-By: Claude Opus 4.6 --- OpenTelemetryPlan/06-implementation-phases.md | 179 +++++++--- .../09-data-collection-reference.md | 6 - OpenTelemetryPlan/OpenTelemetryPlan.md | 5 +- OpenTelemetryPlan/Phase7_taskList.md | 163 +-------- OpenTelemetryPlan/Phase8_taskList.md | 315 ------------------ 5 files changed, 144 insertions(+), 524 deletions(-) delete mode 100644 OpenTelemetryPlan/Phase8_taskList.md diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 353ba4a2ad..056a8b80d6 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -352,15 +352,146 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi **Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. -### Why Migrate +### Motivation: Why Migrate from StatsD to Native OTel Metrics -The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations: UDP fire-and-forget with no delivery guarantees, non-standard `|m` wire format, 1472-byte MTU fragmentation, and a split-brain architecture where traces use OTLP but metrics use StatsD. Phase 7 resolves all of these by implementing a new `OTelCollectorImpl` behind the unchanged `beast::insight::Collector` interface — zero changes at call sites. +The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves. -**What we gain**: Unified OTLP pipeline, delivery guarantees, metric-trace correlation via shared resource attributes, explicit histogram buckets, simpler collector config (no StatsD receiver), and the `|m` meter issue is resolved by mapping to OTel `Counter`. +#### What We Gain -**What we lose**: StatsD ecosystem compatibility (mitigated: `server=statsd` retained as fallback), slightly higher memory (~1-2 MB for OTel aggregation state), and dependency on OTel C++ Metrics SDK stability (mitigated: SDK 1.18.0 is GA). +1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." -See [Phase7_taskList.md](./Phase7_taskList.md) for full rationale, architecture diagrams, and detailed task breakdown. +2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. + +3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). + +4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. + +5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. + +6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. + +7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. + +#### What We Lose + +1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. + +2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. + +3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. + +4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. + +#### Decision + +The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. + +### Architecture + +#### Class Hierarchy (after Phase 7) + +``` +beast::insight::Collector (abstract interface — unchanged) + | + +-- StatsDCollector (existing — retained as fallback, deprecated) + | +-- StatsDCounterImpl -> StatsD |c over UDP + | +-- StatsDGaugeImpl -> StatsD |g over UDP + | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) + | +-- StatsDEventImpl -> StatsD |ms over UDP + | +-- StatsDHookImpl -> 1s periodic callback + | + +-- NullCollector (existing — unchanged, used when disabled) + | +-- NullCounterImpl -> no-op + | +-- NullGaugeImpl -> no-op + | +-- NullMeterImpl -> no-op + | +-- NullEventImpl -> no-op + | +-- NullHookImpl -> no-op + | + +-- OTelCollector (NEW — Phase 7) + +-- OTelCounterImpl -> otel::Counter + +-- OTelGaugeImpl -> otel::ObservableGauge + +-- OTelMeterImpl -> otel::Counter + +-- OTelEventImpl -> otel::Histogram + +-- OTelHookImpl -> 1s periodic callback (same pattern) +``` + +#### Data Flow (after Phase 7) + +```mermaid +graph LR + subgraph rippledNode["rippled Node"] + A["Trace Macros
XRPL_TRACE_SPAN"] + B["beast::insight
OTelCollector"] + end + + subgraph collector["OTel Collector :4317 / :4318"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + BP["Batch Processor"] + SM["SpanMetrics Connector"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends"] + D["Jaeger / Tempo"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000"] + end + + A -->|"OTLP/HTTP :4318
(traces)"| R1 + B -->|"OTLP/HTTP :4318
(metrics)"| R1 + + BP -->|"OTLP/gRPC"| D + SM -->|"RED metrics"| E + R1 -->|"rippled_* metrics
(native OTLP)"| E + + E --> F + D --> F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. + +#### Configuration + +```ini +# [insight] section — new "otel" server option +[insight] +server=otel # NEW: uses OTel OTLP metrics exporter +prefix=rippled # metric name prefix (preserved) + +# Endpoint and auth inherited from [telemetry] section: +[telemetry] +enabled=1 +endpoint=http://localhost:4318/v1/traces +``` + +The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. + +**Backward compatibility**: `server=statsd` continues to work exactly as before. + +See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. ### Instrument Type Mapping @@ -399,43 +530,6 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for full rationale, architecture --- -## 6.8.1 Phase 8: Log-Trace Correlation and Loki Ingestion (Week 13) - -**Objective**: Inject trace context (trace_id, span_id) into rippled's Journal log output and add Grafana Loki as a centralized log backend with bidirectional trace-log correlation in Grafana. - -### Sub-Phases - -**Phase 8a** (code change): Modify `Logs::format()` to read the thread-local OTel span context and prepend `trace_id= span_id=` to every log line emitted within an active span. Zero changes to the ~2,242 JLOG call sites — injection is transparent. - -**Phase 8b** (infra only): Add Grafana Loki to the Docker observability stack and configure the OTel Collector's filelog receiver to parse rippled's log format, extract trace_id, and export to Loki. Configure Grafana's Tempo-to-Loki and Loki-to-Tempo derived field links for one-click correlation. - -See [Phase8_taskList.md](./Phase8_taskList.md) for full motivation, architecture diagrams, and detailed task breakdown. - -### Tasks - -| Task | Description | Sub-Phase | Effort | Risk | -| ---- | ------------------------------------------ | --------- | ------ | ------ | -| 8.1 | Inject trace_id into `Logs::format()` | 8a | 1d | Low | -| 8.2 | Add Loki to Docker Compose stack | 8b | 0.5d | Low | -| 8.3 | Add filelog receiver to OTel Collector | 8b | 1d | Medium | -| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0.5d | Low | -| 8.5 | Update integration tests | 8a + 8b | 0.5d | Low | -| 8.6 | Update documentation | 8a + 8b | 1d | Low | - -**Total Effort**: 4.5 days - -### Exit Criteria - -- [ ] Log lines within active spans contain `trace_id= span_id=` -- [ ] Log lines outside spans have no trace context (clean — no empty fields) -- [ ] Loki ingests rippled logs via OTel Collector filelog receiver -- [ ] Grafana Tempo → Loki one-click correlation works -- [ ] Grafana Loki → Tempo reverse lookup works via derived field -- [ ] Integration test verifies trace_id presence in logs -- [ ] No performance regression from trace_id injection (< 0.1% overhead) - ---- - ## 6.9 Risk Assessment ```mermaid @@ -720,7 +814,6 @@ Clear, measurable criteria for each phase. | Phase 5 | Production deployment | Operators trained | End of Week 9 | | Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | | Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | -| Phase 8 | trace_id in all logs | Loki ingestion working | End of Week 13 | --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 5f4175c3b9..ed7e656ae6 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -486,12 +486,6 @@ rippled_State_Accounting_Full_duration --- -## 5a. Future: Log-Trace Correlation (Phase 8) - -> **Planned**: [Phase8_taskList.md](./Phase8_taskList.md) adds `trace_id` and `span_id` to every JLOG log line emitted within an active OTel span. Combined with Grafana Loki ingestion, this enables one-click navigation between traces (Tempo) and logs (Loki). No changes to JLOG call sites — injection is transparent in `Logs::format()`. - ---- - ## 6. Known Issues | Issue | Impact | Status | diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 833047b844..94d5cfec5b 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -157,7 +157,7 @@ OpenTelemetry Collector configurations are provided for development (with Jaeger ## 6. Implementation Phases -The implementation spans 13 weeks across 8 phases: +The implementation spans 12 weeks across 7 phases: | Phase | Duration | Focus | Key Deliverables | | ----- | ----------- | --------------------- | ----------------------------------------------------------- | @@ -168,9 +168,8 @@ The implementation spans 13 weeks across 8 phases: | 5 | Week 9 | Documentation | Runbook, Dashboards, Training | | 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards | | 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation | -| 8 | Week 13 | Log-Trace Correlation | trace_id in JLOG output, Loki ingestion, Tempo-Loki linking | -**Total Effort**: 65.1 developer-days with 2 developers +**Total Effort**: 60.6 developer-days with 2 developers ➡️ **[View full Implementation Phases](./06-implementation-phases.md)** diff --git a/OpenTelemetryPlan/Phase7_taskList.md b/OpenTelemetryPlan/Phase7_taskList.md index ea8f05d08b..4f4e1d9e97 100644 --- a/OpenTelemetryPlan/Phase7_taskList.md +++ b/OpenTelemetryPlan/Phase7_taskList.md @@ -8,163 +8,12 @@ ### Related Plan Documents -| Document | Relevance | -| -------------------------------------------------------------------- | -------------------------------------------------------------------------- | -| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy (replaced) | -| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections | -| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 summary, exit criteria, success metrics (§6.8) | -| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved | - ---- - -## Motivation: Why Migrate from StatsD to Native OTel Metrics - -### What We Gain - -1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." - -2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. - -3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). - -4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. - -5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. - -6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. - -7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. - -### What We Lose - -1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. - -2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. - -3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. - -4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. - -### Decision - -The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. - ---- - -## Architecture - -### Class Hierarchy (after Phase 7) - -``` -beast::insight::Collector (abstract interface — unchanged) - | - +-- StatsDCollector (existing — retained as fallback, deprecated) - | +-- StatsDCounterImpl -> StatsD |c over UDP - | +-- StatsDGaugeImpl -> StatsD |g over UDP - | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) - | +-- StatsDEventImpl -> StatsD |ms over UDP - | +-- StatsDHookImpl -> 1s periodic callback - | - +-- NullCollector (existing — unchanged, used when disabled) - | +-- NullCounterImpl -> no-op - | +-- NullGaugeImpl -> no-op - | +-- NullMeterImpl -> no-op - | +-- NullEventImpl -> no-op - | +-- NullHookImpl -> no-op - | - +-- OTelCollector (NEW — Phase 7) - +-- OTelCounterImpl -> otel::Counter - +-- OTelGaugeImpl -> otel::ObservableGauge - +-- OTelMeterImpl -> otel::Counter - +-- OTelEventImpl -> otel::Histogram - +-- OTelHookImpl -> 1s periodic callback (same pattern) -``` - -### Instrument Type Mapping - -| beast::insight Type | OTel Metrics SDK Instrument | Rationale | -| --------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | -| `Counter` (int64, delta, `\|c`) | `Counter` | Direct 1:1 — both are monotonic delta counters | -| `Gauge` (uint64, current value, `\|g`) | `ObservableGauge` with async callback | OTel gauges use async observation via callbacks; the existing Hook pattern already provides periodic polling | -| `Meter` (uint64, increment-only, `\|m`) | `Counter` | Meters are semantically counters — this fixes the non-standard `\|m` wire format issue from Phase 6 Task 6.1 | -| `Event` (ms duration, `\|ms`) | `Histogram` with explicit buckets | Duration distributions — use same buckets as SpanMetrics: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000] ms | -| `Hook` (1s periodic callback) | `PeriodicMetricReader` callback alignment | Collection interval matches existing 1s period; hooks fire gauge observations before export | - -### Data Flow (after Phase 7) - -```mermaid -graph LR - subgraph rippledNode["rippled Node"] - A["Trace Macros
XRPL_TRACE_SPAN"] - B["beast::insight
OTelCollector"] - end - - subgraph collector["OTel Collector :4317 / :4318"] - direction TB - R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] - BP["Batch Processor"] - SM["SpanMetrics Connector"] - - R1 --> BP - BP --> SM - end - - subgraph backends["Trace Backends"] - D["Jaeger / Tempo"] - end - - subgraph metrics["Metrics Stack"] - E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] - end - - subgraph viz["Visualization"] - F["Grafana :3000"] - end - - A -->|"OTLP/HTTP :4318
(traces)"| R1 - B -->|"OTLP/HTTP :4318
(metrics)"| R1 - - BP -->|"OTLP/gRPC"| D - SM -->|"RED metrics"| E - R1 -->|"rippled_* metrics
(native OTLP)"| E - - E --> F - D --> F - - style A fill:#4a90d9,color:#fff,stroke:#2a6db5 - style B fill:#d9534f,color:#fff,stroke:#b52d2d - style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d - style BP fill:#449d44,color:#fff,stroke:#2d6e2d - style SM fill:#449d44,color:#fff,stroke:#2d6e2d - style D fill:#f0ad4e,color:#000,stroke:#c78c2e - style E fill:#f0ad4e,color:#000,stroke:#c78c2e - style F fill:#5bc0de,color:#000,stroke:#3aa8c1 - style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 - style collector fill:#1a3320,color:#ccc,stroke:#5cb85c - style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e - style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e - style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de -``` - -**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. - -### Configuration - -```ini -# [insight] section — new "otel" server option -[insight] -server=otel # NEW: uses OTel OTLP metrics exporter -prefix=rippled # metric name prefix (preserved) - -# Endpoint and auth inherited from [telemetry] section: -[telemetry] -enabled=1 -endpoint=http://localhost:4318/v1/traces -``` - -The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. - -**Backward compatibility**: `server=statsd` continues to work exactly as before. +| Document | Relevance | +| -------------------------------------------------------------------- | --------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 7 plan: motivation, architecture, exit criteria (§6.8) | +| [02-design-decisions.md](./02-design-decisions.md) | Collector interface design, beast::insight coexistence strategy | +| [05-configuration-reference.md](./05-configuration-reference.md) | `[insight]` and `[telemetry]` config sections | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Complete metric inventory that must be preserved | --- diff --git a/OpenTelemetryPlan/Phase8_taskList.md b/OpenTelemetryPlan/Phase8_taskList.md deleted file mode 100644 index 72c54188c3..0000000000 --- a/OpenTelemetryPlan/Phase8_taskList.md +++ /dev/null @@ -1,315 +0,0 @@ -# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List - -> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability. -> -> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API. -> -> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`) - -### Related Plan Documents - -| Document | Relevance | -| ---------------------------------------------------------------- | ------------------------------------------------------------- | -| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 summary and exit criteria (§6.8.1) | -| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning | -| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working | -| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) | - ---- - -## Motivation - -### The Problem - -rippled emits ~2,242 JLOG calls across 194 files. When investigating an issue (e.g., a consensus failure or a slow RPC), operators must manually correlate timestamps between Jaeger traces and log files — a tedious, error-prone process that loses context. - -### What We Gain - -1. **One-click trace-to-log correlation** — Grafana's Tempo → Loki link lets operators click a trace span and see all log lines emitted during that span's lifetime. No manual timestamp matching. - -2. **Log-to-trace reverse lookup** — An error in the logs like `"Transaction failed"` can be traced back to the exact distributed trace via the `trace_id` field. Grafana Loki's LogQL: `{job="rippled"} |= "trace_id=abc123"`. - -3. **Structured log context** — trace_id and span_id make logs queryable by trace, enabling aggregate analysis (e.g., "show all logs for consensus rounds that took >5s"). - -4. **Zero call-site changes** — The injection happens in `Logs::format()` or `Logs::write()`, transparently to the ~2,242 JLOG call sites. - -5. **Centralized log storage** — Loki provides indexed, compressed log retention with LogQL queries, replacing ad-hoc `grep` on individual node log files. - -### What We Lose - -1. **Slightly larger log lines** — Each log line grows by ~70 characters (`trace_id=<32hex> span_id=<16hex>`). At ~1000 lines/min, this adds ~70 KB/min of log volume. - -2. **Thread-local dependency** — trace_id is only available when a span is active on the current thread. JLOG calls outside any span context will have empty trace_id (this is fine — it indicates non-traced code paths). - -3. **Loki infrastructure** — Adds another backend (Loki) to the observability stack. Mitigated: Loki is lightweight and can run as a single binary for development. - ---- - -## Architecture - -### Phase 8a: Trace ID Injection into Logs - -``` -Thread with active OTel span: - JLOG(journal_.info()) << "Processing transaction " << txHash; - -Flow: - 1. ScopedStream destructor calls sink.write(severity, text) - 2. Logs::write() calls Logs::format() - 3. format() checks thread-local OTel span context - 4. If span is active: prepends "trace_id= span_id=" to message - 5. Output: "2026-03-06T12:34:56.789Z Consensus:INF trace_id=abc123 span_id=def456 Processing transaction ..." -``` - -**Key design decision**: Inject in `Logs::format()` (the static formatting function) by accessing the thread-local OTel span context via `opentelemetry::trace::GetSpan(opentelemetry::context::RuntimeContext::GetCurrent())`. This is a read-only, lock-free operation. - -**Conditional**: Only inject when `[telemetry] enabled=1` and a span is active. When telemetry is disabled at build time (`-Dtelemetry=OFF`), the injection compiles to nothing. - -### Phase 8b: Filelog Receiver to Loki - -```mermaid -graph LR - subgraph rippledNode["rippled Node"] - A["JLOG output
with trace_id"] - end - - A -->|"stdout / debug.log"| FL - - subgraph collector["OTel Collector"] - FL["Filelog Receiver
parses rippled log format"] - BP["Batch Processor"] - - FL --> BP - end - - BP -->|"OTLP/gRPC"| L["Grafana Loki
:3100"] - L --> G["Grafana :3000
Explore: Loki + Tempo"] - - style A fill:#4a90d9,color:#fff,stroke:#2a6db5 - style FL fill:#5cb85c,color:#fff,stroke:#3d8b3d - style BP fill:#449d44,color:#fff,stroke:#2d6e2d - style L fill:#9c27b0,color:#fff,stroke:#6a1b9a - style G fill:#5bc0de,color:#000,stroke:#3aa8c1 - style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 - style collector fill:#1a3320,color:#ccc,stroke:#5cb85c -``` - -**OTel Collector filelog receiver** parses rippled's log format, extracts `trace_id` and `span_id` as log attributes, and exports to Loki via OTLP. Grafana correlates Tempo traces with Loki logs via the shared `trace_id`. - ---- - -## Task 8.1: Inject trace_id into Logs::format() - -**Objective**: Add OTel trace context to every log line that is emitted within an active span. - -**What to do**: - -- Edit `src/libxrpl/basics/Log.cpp`: - - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span: - ```cpp - #ifdef XRPL_ENABLE_TELEMETRY - auto span = opentelemetry::trace::GetSpan( - opentelemetry::context::RuntimeContext::GetCurrent()); - auto ctx = span->GetContext(); - if (ctx.IsValid()) - { - // Append trace context as structured fields - char traceId[33], spanId[17]; - ctx.trace_id().ToLowerBase16(traceId); - ctx.span_id().ToLowerBase16(spanId); - output += "trace_id="; - output.append(traceId, 32); - output += " span_id="; - output.append(spanId, 16); - output += ' '; - } - #endif - ``` - - Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY` - -- Edit `include/xrpl/basics/Log.h`: - - No changes needed — format() signature unchanged - -**Key modified files**: - -- `src/libxrpl/basics/Log.cpp` - -**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead. - ---- - -## Task 8.2: Add Loki to Docker Compose Stack - -**Objective**: Add Grafana Loki as a log storage backend in the development observability stack. - -**What to do**: - -- Edit `docker/telemetry/docker-compose.yml`: - - Add Loki service: - ```yaml - loki: - image: grafana/loki:2.9.0 - ports: - - "3100:3100" - command: -config.file=/etc/loki/local-config.yaml - ``` - - Add Loki as a Grafana data source in provisioning - -- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`: - - Configure Loki data source with derived fields linking `trace_id` to Tempo - -**Key new files**: - -- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` - -**Key modified files**: - -- `docker/telemetry/docker-compose.yml` - ---- - -## Task 8.3: Add Filelog Receiver to OTel Collector - -**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki. - -**What to do**: - -- Edit `docker/telemetry/otel-collector-config.yaml`: - - Add `filelog` receiver: - ```yaml - receivers: - filelog: - include: [/var/log/rippled/debug.log] - operators: - - type: regex_parser - regex: '^(?P\S+)\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' - timestamp: - parse_from: attributes.timestamp - layout: "%Y-%m-%dT%H:%M:%S.%fZ" - ``` - - Add logs pipeline: - ```yaml - service: - pipelines: - logs: - receivers: [filelog] - processors: [batch] - exporters: [otlp/loki] - ``` - - Add Loki exporter: - ```yaml - exporters: - otlp/loki: - endpoint: loki:3100 - tls: - insecure: true - ``` - -- Mount rippled's log directory into the collector container via docker-compose volume - -**Key modified files**: - -- `docker/telemetry/otel-collector-config.yaml` -- `docker/telemetry/docker-compose.yml` - ---- - -## Task 8.4: Configure Grafana Trace-to-Log Correlation - -**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana. - -**What to do**: - -- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration: - - ```yaml - tracesToLogs: - datasourceUid: loki - filterByTraceID: true - filterBySpanID: false - tags: ["partition", "severity"] - ``` - -- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo: - ```yaml - derivedFields: - - datasourceUid: tempo - matcherRegex: "trace_id=(\\w+)" - name: TraceID - url: "$${__value.raw}" - ``` - -**Key modified files**: - -- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` -- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file) - ---- - -## Task 8.5: Update Integration Tests - -**Objective**: Verify trace_id appears in logs and Loki correlation works. - -**What to do**: - -- Edit `docker/telemetry/integration-test.sh`: - - After sending RPC requests (which create spans), grep rippled's log output for `trace_id=` - - Verify trace_id matches a trace visible in Jaeger - - Optionally: query Loki via API to confirm log ingestion - -**Key modified files**: - -- `docker/telemetry/integration-test.sh` - ---- - -## Task 8.6: Update Documentation - -**Objective**: Document the log correlation feature in runbook and reference docs. - -**What to do**: - -- Edit `docs/telemetry-runbook.md`: - - Add "Log-Trace Correlation" section explaining how to use Grafana Tempo → Loki linking - - Add LogQL query examples for filtering by trace_id - -- Edit `OpenTelemetryPlan/09-data-collection-reference.md`: - - Add new section "3. Log Correlation" between SpanMetrics and StatsD sections - - Document the log format with trace_id injection - - Document Loki as a new backend - -- Edit `docker/telemetry/TESTING.md`: - - Add log correlation verification steps - -**Key modified files**: - -- `docs/telemetry-runbook.md` -- `OpenTelemetryPlan/09-data-collection-reference.md` -- `docker/telemetry/TESTING.md` - ---- - -## Summary Table - -| Task | Description | Sub-Phase | New Files | Modified Files | Effort | Risk | Depends On | -| ---- | ------------------------------------------ | --------- | --------- | -------------- | ------ | ------ | ---------- | -| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | 1d | Low | Phase 7 | -| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | 0.5d | Low | — | -| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 1d | Medium | 8.1, 8.2 | -| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 0.5d | Low | 8.3 | -| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 0.5d | Low | 8.4 | -| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 1d | Low | 8.5 | - -**Total Effort**: 4.5 days - -**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential. - -**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)): - -- [ ] Log lines within active spans contain `trace_id= span_id=` -- [ ] Log lines outside spans have no trace context (no empty fields) -- [ ] Loki ingests rippled logs via OTel Collector filelog receiver -- [ ] Grafana Tempo → Loki one-click correlation works -- [ ] Grafana Loki → Tempo reverse lookup works via derived field -- [ ] Integration test verifies trace_id presence in logs -- [ ] No performance regression from trace_id injection (< 0.1% overhead)