From e39ae6fc1b98e8f61234bd84d2565a2c052d6620 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Fri, 6 Mar 2026 16:01:38 +0000 Subject: [PATCH] Phase 8: Log-trace correlation plan docs and task list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan (motivation, architecture, Mermaid diagrams, tasks table, exit criteria) - Add Phase8_taskList.md with per-task breakdown (8.1-8.6) - Add §5a log-trace correlation section to 09-data-collection-reference.md - Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases - Add Phases 6-8 to Gantt chart in 06-implementation-phases.md Co-Authored-By: Claude Opus 4.6 --- OpenTelemetryPlan/06-implementation-phases.md | 120 +++++++++ .../09-data-collection-reference.md | 32 +++ OpenTelemetryPlan/OpenTelemetryPlan.md | 5 +- OpenTelemetryPlan/Phase8_taskList.md | 234 ++++++++++++++++++ 4 files changed, 389 insertions(+), 2 deletions(-) create mode 100644 OpenTelemetryPlan/Phase8_taskList.md diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 2c82fce5a9..20be52be20 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -42,6 +42,15 @@ gantt section Phase 5 Documentation & Deploy :p5, after p4, 1w + + section Phase 6 + StatsD Metrics Bridge :p6, after p5, 1w + + section Phase 7 + Native OTel Metrics :p7, after p6, 2w + + section Phase 8 + Log-Trace Correlation :p8, after p7, 1w ``` --- @@ -448,6 +457,116 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. --- +## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) + +### Motivation + +rippled's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver. + +#### Gains + +1. **One-click trace-to-log navigation** — Click a trace in Tempo/Jaeger and immediately see the corresponding log lines in Loki, filtered by `trace_id`. +2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo. +3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance. +4. **Zero new dependencies in rippled** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1. +5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min. + +#### Losses / Risks + +1. **Log format change** — Existing log parsers that rely on a fixed format will need updating to handle the optional `trace_id=... span_id=...` fields. +2. **Loki resource usage** — Log ingestion adds storage and memory overhead to the observability stack (mitigated by retention policies). +3. **Filelog receiver complexity** — The regex parser must be kept in sync with the log format; a format change in `Logs::format()` could break parsing. + +#### Decision + +The correlation value far outweighs the risks. The log format change is backward-compatible (fields are appended only when a span is active), and the filelog receiver regex is straightforward to maintain. + +### Architecture + +Phase 8 has two independent sub-phases that can be developed in parallel: + +- **Phase 8a (code change)**: Modify `Logs::format()` in `src/libxrpl/basics/Log.cpp` to append `trace_id= span_id=` when the current thread has an active OTel span. Guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. +- **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail rippled's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana Tempo↔Loki bidirectional linking. + +#### Trace ID Injection Flow + +```mermaid +flowchart LR + subgraph rippled["rippled process"] + JLOG["JLOG(j.info())"] + Format["Logs::format()"] + OTelCtx["OTel Context
(thread-local)"] + JLOG --> Format + OTelCtx -.->|"GetSpan()→GetContext()"| Format + end + + subgraph output["Log Output"] + LogLine["2024-01-15T10:30:45.123Z
LedgerMaster:NFO
trace_id=abc123...
span_id=def456...
Validated ledger 42"] + end + + Format --> LogLine + + style rippled fill:#1a237e,stroke:#0d1642,color:#fff + style output fill:#1b5e20,stroke:#0d3d14,color:#fff + style JLOG fill:#283593,stroke:#1a237e,color:#fff + style Format fill:#283593,stroke:#1a237e,color:#fff + style OTelCtx fill:#283593,stroke:#1a237e,color:#fff + style LogLine fill:#2e7d32,stroke:#1b5e20,color:#fff +``` + +#### Loki Ingestion Pipeline + +```mermaid +flowchart LR + subgraph collector["OTel Collector"] + FR["filelog receiver
tails debug.log"] + RP["regex_parser
extracts trace_id,
span_id, severity"] + BP["batch processor"] + LE["otlp/loki exporter"] + FR --> RP --> BP --> LE + end + + LogFile["rippled
debug.log"] --> FR + LE --> Loki["Grafana Loki
:3100"] + Loki <-->|"derivedFields ↔
tracesToLogs"| Tempo["Grafana Tempo"] + + style collector fill:#e65100,stroke:#bf360c,color:#fff + style FR fill:#f57c00,stroke:#e65100,color:#fff + style RP fill:#f57c00,stroke:#e65100,color:#fff + style BP fill:#f57c00,stroke:#e65100,color:#fff + style LE fill:#f57c00,stroke:#e65100,color:#fff + style LogFile fill:#1a237e,stroke:#0d1642,color:#fff + style Loki fill:#4a148c,stroke:#2e0d57,color:#fff + style Tempo fill:#4a148c,stroke:#2e0d57,color:#fff +``` + +### Tasks + +| Task | Description | Effort | Risk | +| ---- | ---------------------------------------------- | ------ | ------ | +| 8.1 | Inject trace_id into Logs::format() | 1d | Low | +| 8.2 | Add Loki to Docker Compose stack | 0.5d | Low | +| 8.3 | Add filelog receiver to OTel Collector | 1d | Medium | +| 8.4 | Configure Grafana trace-to-log correlation | 0.5d | Low | +| 8.5 | Update integration tests | 0.5d | Low | +| 8.6 | Update documentation (runbook, reference docs) | 1d | Low | + +**Total Effort**: 4.5 days + +**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3–8.6 are sequential. + +### Exit Criteria + +- [ ] Log lines within active spans contain `trace_id= span_id=` +- [ ] Log lines outside spans have no trace context (no empty fields) +- [ ] Loki ingests rippled logs via OTel Collector filelog receiver +- [ ] Grafana Tempo → Loki one-click correlation works +- [ ] Grafana Loki → Tempo reverse lookup works via derived field +- [ ] Integration test verifies trace_id presence in logs +- [ ] No performance regression from trace_id injection (< 0.1% overhead) + +--- + ## 6.9 Risk Assessment ```mermaid @@ -732,6 +851,7 @@ Clear, measurable criteria for each phase. | Phase 5 | Production deployment | Operators trained | End of Week 9 | | Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | | Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | +| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index e671948945..bfa402ca6f 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -494,6 +494,38 @@ rippled_State_Accounting_Full_duration --- +## 5a. Future: Log-Trace Correlation (Phase 8) + +> **Plan details**: [06-implementation-phases.md §6.8.1](./06-implementation-phases.md) — motivation, architecture, Mermaid diagrams +> **Task breakdown**: [Phase8_taskList.md](./Phase8_taskList.md) — per-task implementation details + +Phase 8 injects OTel trace context into rippled's `Logs::format()` output, enabling log-trace correlation: + +``` +2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456... span_id=0123456789abcdef Validated ledger 42 +``` + +- **`trace_id=`** — Links to the distributed trace in Tempo/Jaeger +- **`span_id=`** — Identifies the specific span within the trace +- **Only present** when the log is emitted within an active OTel span +- **Ingested** via OTel Collector filelog receiver → Grafana Loki +- **Correlated** via Grafana derived fields (Loki↔Tempo bidirectional linking) + +### LogQL Query Examples + +```promql +# Find all logs for a specific trace +{job="rippled"} |= "trace_id=abc123def456" + +# Error logs with trace context +{job="rippled"} | regexp `(?P\w+):(?PERR|WRN)` | severity != "" + +# Logs from a specific partition with trace context +{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P\w+)` | trace_id != "" +``` + +--- + ## 6. Known Issues | Issue | Impact | Status | diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 94d5cfec5b..f7f115cc06 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -157,7 +157,7 @@ OpenTelemetry Collector configurations are provided for development (with Jaeger ## 6. Implementation Phases -The implementation spans 12 weeks across 7 phases: +The implementation spans 13 weeks across 8 phases: | Phase | Duration | Focus | Key Deliverables | | ----- | ----------- | --------------------- | ----------------------------------------------------------- | @@ -168,8 +168,9 @@ The implementation spans 12 weeks across 7 phases: | 5 | Week 9 | Documentation | Runbook, Dashboards, Training | | 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards | | 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation | +| 8 | Week 13 | Log-Trace Correlation | trace_id in logs, Loki ingestion, Tempo↔Loki linking | -**Total Effort**: 60.6 developer-days with 2 developers +**Total Effort**: 65.1 developer-days with 2 developers ➡️ **[View full Implementation Phases](./06-implementation-phases.md)** diff --git a/OpenTelemetryPlan/Phase8_taskList.md b/OpenTelemetryPlan/Phase8_taskList.md new file mode 100644 index 0000000000..32d928c9a9 --- /dev/null +++ b/OpenTelemetryPlan/Phase8_taskList.md @@ -0,0 +1,234 @@ +# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List + +> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability. +> +> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API. +> +> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`) + +### Related Plan Documents + +| Document | Relevance | +| ---------------------------------------------------------------- | -------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 plan: motivation, architecture, exit criteria (§6.8.1) | +| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning | +| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working | +| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) | + +--- + +## Task 8.1: Inject trace_id into Logs::format() + +**Objective**: Add OTel trace context to every log line that is emitted within an active span. + +**What to do**: + +- Edit `src/libxrpl/basics/Log.cpp`: + - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span: + ```cpp + #ifdef XRPL_ENABLE_TELEMETRY + auto span = opentelemetry::trace::GetSpan( + opentelemetry::context::RuntimeContext::GetCurrent()); + auto ctx = span->GetContext(); + if (ctx.IsValid()) + { + // Append trace context as structured fields + char traceId[33], spanId[17]; + ctx.trace_id().ToLowerBase16(traceId); + ctx.span_id().ToLowerBase16(spanId); + output += "trace_id="; + output.append(traceId, 32); + output += " span_id="; + output.append(spanId, 16); + output += ' '; + } + #endif + ``` + - Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY` + +- Edit `include/xrpl/basics/Log.h`: + - No changes needed — format() signature unchanged + +**Key modified files**: + +- `src/libxrpl/basics/Log.cpp` + +**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead. + +--- + +## Task 8.2: Add Loki to Docker Compose Stack + +**Objective**: Add Grafana Loki as a log storage backend in the development observability stack. + +**What to do**: + +- Edit `docker/telemetry/docker-compose.yml`: + - Add Loki service: + ```yaml + loki: + image: grafana/loki:2.9.0 + ports: + - "3100:3100" + command: -config.file=/etc/loki/local-config.yaml + ``` + - Add Loki as a Grafana data source in provisioning + +- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`: + - Configure Loki data source with derived fields linking `trace_id` to Tempo + +**Key new files**: + +- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` + +**Key modified files**: + +- `docker/telemetry/docker-compose.yml` + +--- + +## Task 8.3: Add Filelog Receiver to OTel Collector + +**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki. + +**What to do**: + +- Edit `docker/telemetry/otel-collector-config.yaml`: + - Add `filelog` receiver: + ```yaml + receivers: + filelog: + include: [/var/log/rippled/debug.log] + operators: + - type: regex_parser + regex: '^(?P\S+)\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' + timestamp: + parse_from: attributes.timestamp + layout: "%Y-%m-%dT%H:%M:%S.%fZ" + ``` + - Add logs pipeline: + ```yaml + service: + pipelines: + logs: + receivers: [filelog] + processors: [batch] + exporters: [otlp/loki] + ``` + - Add Loki exporter: + ```yaml + exporters: + otlp/loki: + endpoint: loki:3100 + tls: + insecure: true + ``` + +- Mount rippled's log directory into the collector container via docker-compose volume + +**Key modified files**: + +- `docker/telemetry/otel-collector-config.yaml` +- `docker/telemetry/docker-compose.yml` + +--- + +## Task 8.4: Configure Grafana Trace-to-Log Correlation + +**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana. + +**What to do**: + +- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration: + + ```yaml + tracesToLogs: + datasourceUid: loki + filterByTraceID: true + filterBySpanID: false + tags: ["partition", "severity"] + ``` + +- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo: + ```yaml + derivedFields: + - datasourceUid: tempo + matcherRegex: "trace_id=(\\w+)" + name: TraceID + url: "$${__value.raw}" + ``` + +**Key modified files**: + +- `docker/telemetry/grafana/provisioning/datasources/loki.yaml` +- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file) + +--- + +## Task 8.5: Update Integration Tests + +**Objective**: Verify trace_id appears in logs and Loki correlation works. + +**What to do**: + +- Edit `docker/telemetry/integration-test.sh`: + - After sending RPC requests (which create spans), grep rippled's log output for `trace_id=` + - Verify trace_id matches a trace visible in Jaeger + - Optionally: query Loki via API to confirm log ingestion + +**Key modified files**: + +- `docker/telemetry/integration-test.sh` + +--- + +## Task 8.6: Update Documentation + +**Objective**: Document the log correlation feature in runbook and reference docs. + +**What to do**: + +- Edit `docs/telemetry-runbook.md`: + - Add "Log-Trace Correlation" section explaining how to use Grafana Tempo -> Loki linking + - Add LogQL query examples for filtering by trace_id + +- Edit `OpenTelemetryPlan/09-data-collection-reference.md`: + - Add new section "3. Log Correlation" between SpanMetrics and StatsD sections + - Document the log format with trace_id injection + - Document Loki as a new backend + +- Edit `docker/telemetry/TESTING.md`: + - Add log correlation verification steps + +**Key modified files**: + +- `docs/telemetry-runbook.md` +- `OpenTelemetryPlan/09-data-collection-reference.md` +- `docker/telemetry/TESTING.md` + +--- + +## Summary Table + +| Task | Description | Sub-Phase | New Files | Modified Files | Effort | Risk | Depends On | +| ---- | ------------------------------------------ | --------- | --------- | -------------- | ------ | ------ | ---------- | +| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | 1d | Low | Phase 7 | +| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | 0.5d | Low | -- | +| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 1d | Medium | 8.1, 8.2 | +| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 0.5d | Low | 8.3 | +| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 0.5d | Low | 8.4 | +| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 1d | Low | 8.5 | + +**Total Effort**: 4.5 days + +**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential. + +**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)): + +- [ ] Log lines within active spans contain `trace_id= span_id=` +- [ ] Log lines outside spans have no trace context (no empty fields) +- [ ] Loki ingests rippled logs via OTel Collector filelog receiver +- [ ] Grafana Tempo -> Loki one-click correlation works +- [ ] Grafana Loki -> Tempo reverse lookup works via derived field +- [ ] Integration test verifies trace_id presence in logs +- [ ] No performance regression from trace_id injection (< 0.1% overhead)