mirror of
https://github.com/XRPLF/rippled.git
synced 2026-03-17 02:02:33 +00:00
Phase 8: Log-trace correlation plan docs and task list
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan (motivation, architecture, Mermaid diagrams, tasks table, exit criteria) - Add Phase8_taskList.md with per-task breakdown (8.1-8.6) - Add §5a log-trace correlation section to 09-data-collection-reference.md - Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases - Add Phases 6-8 to Gantt chart in 06-implementation-phases.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -42,6 +42,15 @@ gantt
|
||||
|
||||
section Phase 5
|
||||
Documentation & Deploy :p5, after p4, 1w
|
||||
|
||||
section Phase 6
|
||||
StatsD Metrics Bridge :p6, after p5, 1w
|
||||
|
||||
section Phase 7
|
||||
Native OTel Metrics :p7, after p6, 2w
|
||||
|
||||
section Phase 8
|
||||
Log-Trace Correlation :p8, after p7, 1w
|
||||
```
|
||||
|
||||
---
|
||||
@@ -448,6 +457,116 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown.
|
||||
|
||||
---
|
||||
|
||||
## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13)
|
||||
|
||||
### Motivation
|
||||
|
||||
rippled's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver.
|
||||
|
||||
#### Gains
|
||||
|
||||
1. **One-click trace-to-log navigation** — Click a trace in Tempo/Jaeger and immediately see the corresponding log lines in Loki, filtered by `trace_id`.
|
||||
2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo.
|
||||
3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance.
|
||||
4. **Zero new dependencies in rippled** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1.
|
||||
5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min.
|
||||
|
||||
#### Losses / Risks
|
||||
|
||||
1. **Log format change** — Existing log parsers that rely on a fixed format will need updating to handle the optional `trace_id=... span_id=...` fields.
|
||||
2. **Loki resource usage** — Log ingestion adds storage and memory overhead to the observability stack (mitigated by retention policies).
|
||||
3. **Filelog receiver complexity** — The regex parser must be kept in sync with the log format; a format change in `Logs::format()` could break parsing.
|
||||
|
||||
#### Decision
|
||||
|
||||
The correlation value far outweighs the risks. The log format change is backward-compatible (fields are appended only when a span is active), and the filelog receiver regex is straightforward to maintain.
|
||||
|
||||
### Architecture
|
||||
|
||||
Phase 8 has two independent sub-phases that can be developed in parallel:
|
||||
|
||||
- **Phase 8a (code change)**: Modify `Logs::format()` in `src/libxrpl/basics/Log.cpp` to append `trace_id=<hex32> span_id=<hex16>` when the current thread has an active OTel span. Guarded by `#ifdef XRPL_ENABLE_TELEMETRY`.
|
||||
- **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail rippled's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana Tempo↔Loki bidirectional linking.
|
||||
|
||||
#### Trace ID Injection Flow
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph rippled["rippled process"]
|
||||
JLOG["JLOG(j.info())"]
|
||||
Format["Logs::format()"]
|
||||
OTelCtx["OTel Context<br/>(thread-local)"]
|
||||
JLOG --> Format
|
||||
OTelCtx -.->|"GetSpan()→GetContext()"| Format
|
||||
end
|
||||
|
||||
subgraph output["Log Output"]
|
||||
LogLine["2024-01-15T10:30:45.123Z<br/>LedgerMaster:NFO<br/>trace_id=abc123...<br/>span_id=def456...<br/>Validated ledger 42"]
|
||||
end
|
||||
|
||||
Format --> LogLine
|
||||
|
||||
style rippled fill:#1a237e,stroke:#0d1642,color:#fff
|
||||
style output fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style JLOG fill:#283593,stroke:#1a237e,color:#fff
|
||||
style Format fill:#283593,stroke:#1a237e,color:#fff
|
||||
style OTelCtx fill:#283593,stroke:#1a237e,color:#fff
|
||||
style LogLine fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
```
|
||||
|
||||
#### Loki Ingestion Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph collector["OTel Collector"]
|
||||
FR["filelog receiver<br/>tails debug.log"]
|
||||
RP["regex_parser<br/>extracts trace_id,<br/>span_id, severity"]
|
||||
BP["batch processor"]
|
||||
LE["otlp/loki exporter"]
|
||||
FR --> RP --> BP --> LE
|
||||
end
|
||||
|
||||
LogFile["rippled<br/>debug.log"] --> FR
|
||||
LE --> Loki["Grafana Loki<br/>:3100"]
|
||||
Loki <-->|"derivedFields ↔<br/>tracesToLogs"| Tempo["Grafana Tempo"]
|
||||
|
||||
style collector fill:#e65100,stroke:#bf360c,color:#fff
|
||||
style FR fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style RP fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style BP fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style LE fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style LogFile fill:#1a237e,stroke:#0d1642,color:#fff
|
||||
style Loki fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style Tempo fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | ---------------------------------------------- | ------ | ------ |
|
||||
| 8.1 | Inject trace_id into Logs::format() | 1d | Low |
|
||||
| 8.2 | Add Loki to Docker Compose stack | 0.5d | Low |
|
||||
| 8.3 | Add filelog receiver to OTel Collector | 1d | Medium |
|
||||
| 8.4 | Configure Grafana trace-to-log correlation | 0.5d | Low |
|
||||
| 8.5 | Update integration tests | 0.5d | Low |
|
||||
| 8.6 | Update documentation (runbook, reference docs) | 1d | Low |
|
||||
|
||||
**Total Effort**: 4.5 days
|
||||
|
||||
**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3–8.6 are sequential.
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
|
||||
- [ ] Log lines outside spans have no trace context (no empty fields)
|
||||
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
|
||||
- [ ] Grafana Tempo → Loki one-click correlation works
|
||||
- [ ] Grafana Loki → Tempo reverse lookup works via derived field
|
||||
- [ ] Integration test verifies trace_id presence in logs
|
||||
- [ ] No performance regression from trace_id injection (< 0.1% overhead)
|
||||
|
||||
---
|
||||
|
||||
## 6.9 Risk Assessment
|
||||
|
||||
```mermaid
|
||||
@@ -732,6 +851,7 @@ Clear, measurable criteria for each phase.
|
||||
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
|
||||
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 |
|
||||
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 |
|
||||
| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -494,6 +494,38 @@ rippled_State_Accounting_Full_duration
|
||||
|
||||
---
|
||||
|
||||
## 5a. Future: Log-Trace Correlation (Phase 8)
|
||||
|
||||
> **Plan details**: [06-implementation-phases.md §6.8.1](./06-implementation-phases.md) — motivation, architecture, Mermaid diagrams
|
||||
> **Task breakdown**: [Phase8_taskList.md](./Phase8_taskList.md) — per-task implementation details
|
||||
|
||||
Phase 8 injects OTel trace context into rippled's `Logs::format()` output, enabling log-trace correlation:
|
||||
|
||||
```
|
||||
2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456... span_id=0123456789abcdef Validated ledger 42
|
||||
```
|
||||
|
||||
- **`trace_id=<hex32>`** — Links to the distributed trace in Tempo/Jaeger
|
||||
- **`span_id=<hex16>`** — Identifies the specific span within the trace
|
||||
- **Only present** when the log is emitted within an active OTel span
|
||||
- **Ingested** via OTel Collector filelog receiver → Grafana Loki
|
||||
- **Correlated** via Grafana derived fields (Loki↔Tempo bidirectional linking)
|
||||
|
||||
### LogQL Query Examples
|
||||
|
||||
```promql
|
||||
# Find all logs for a specific trace
|
||||
{job="rippled"} |= "trace_id=abc123def456"
|
||||
|
||||
# Error logs with trace context
|
||||
{job="rippled"} | regexp `(?P<severity>\w+):(?P<level>ERR|WRN)` | severity != ""
|
||||
|
||||
# Logs from a specific partition with trace context
|
||||
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>\w+)` | trace_id != ""
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Known Issues
|
||||
|
||||
| Issue | Impact | Status |
|
||||
|
||||
@@ -157,7 +157,7 @@ OpenTelemetry Collector configurations are provided for development (with Jaeger
|
||||
|
||||
## 6. Implementation Phases
|
||||
|
||||
The implementation spans 12 weeks across 7 phases:
|
||||
The implementation spans 13 weeks across 8 phases:
|
||||
|
||||
| Phase | Duration | Focus | Key Deliverables |
|
||||
| ----- | ----------- | --------------------- | ----------------------------------------------------------- |
|
||||
@@ -168,8 +168,9 @@ The implementation spans 12 weeks across 7 phases:
|
||||
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
|
||||
| 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards |
|
||||
| 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation |
|
||||
| 8 | Week 13 | Log-Trace Correlation | trace_id in logs, Loki ingestion, Tempo↔Loki linking |
|
||||
|
||||
**Total Effort**: 60.6 developer-days with 2 developers
|
||||
**Total Effort**: 65.1 developer-days with 2 developers
|
||||
|
||||
➡️ **[View full Implementation Phases](./06-implementation-phases.md)**
|
||||
|
||||
|
||||
234
OpenTelemetryPlan/Phase8_taskList.md
Normal file
234
OpenTelemetryPlan/Phase8_taskList.md
Normal file
@@ -0,0 +1,234 @@
|
||||
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List
|
||||
|
||||
> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.
|
||||
>
|
||||
> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API.
|
||||
>
|
||||
> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`)
|
||||
|
||||
### Related Plan Documents
|
||||
|
||||
| Document | Relevance |
|
||||
| ---------------------------------------------------------------- | -------------------------------------------------------------- |
|
||||
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 plan: motivation, architecture, exit criteria (§6.8.1) |
|
||||
| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning |
|
||||
| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working |
|
||||
| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) |
|
||||
|
||||
---
|
||||
|
||||
## Task 8.1: Inject trace_id into Logs::format()
|
||||
|
||||
**Objective**: Add OTel trace context to every log line that is emitted within an active span.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/libxrpl/basics/Log.cpp`:
|
||||
- In `Logs::format()` (around line 346), after severity is appended, check for active OTel span:
|
||||
```cpp
|
||||
#ifdef XRPL_ENABLE_TELEMETRY
|
||||
auto span = opentelemetry::trace::GetSpan(
|
||||
opentelemetry::context::RuntimeContext::GetCurrent());
|
||||
auto ctx = span->GetContext();
|
||||
if (ctx.IsValid())
|
||||
{
|
||||
// Append trace context as structured fields
|
||||
char traceId[33], spanId[17];
|
||||
ctx.trace_id().ToLowerBase16(traceId);
|
||||
ctx.span_id().ToLowerBase16(spanId);
|
||||
output += "trace_id=";
|
||||
output.append(traceId, 32);
|
||||
output += " span_id=";
|
||||
output.append(spanId, 16);
|
||||
output += ' ';
|
||||
}
|
||||
#endif
|
||||
```
|
||||
- Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY`
|
||||
|
||||
- Edit `include/xrpl/basics/Log.h`:
|
||||
- No changes needed — format() signature unchanged
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `src/libxrpl/basics/Log.cpp`
|
||||
|
||||
**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead.
|
||||
|
||||
---
|
||||
|
||||
## Task 8.2: Add Loki to Docker Compose Stack
|
||||
|
||||
**Objective**: Add Grafana Loki as a log storage backend in the development observability stack.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/docker-compose.yml`:
|
||||
- Add Loki service:
|
||||
```yaml
|
||||
loki:
|
||||
image: grafana/loki:2.9.0
|
||||
ports:
|
||||
- "3100:3100"
|
||||
command: -config.file=/etc/loki/local-config.yaml
|
||||
```
|
||||
- Add Loki as a Grafana data source in provisioning
|
||||
|
||||
- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`:
|
||||
- Configure Loki data source with derived fields linking `trace_id` to Tempo
|
||||
|
||||
**Key new files**:
|
||||
|
||||
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/docker-compose.yml`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.3: Add Filelog Receiver to OTel Collector
|
||||
|
||||
**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/otel-collector-config.yaml`:
|
||||
- Add `filelog` receiver:
|
||||
```yaml
|
||||
receivers:
|
||||
filelog:
|
||||
include: [/var/log/rippled/debug.log]
|
||||
operators:
|
||||
- type: regex_parser
|
||||
regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
|
||||
timestamp:
|
||||
parse_from: attributes.timestamp
|
||||
layout: "%Y-%m-%dT%H:%M:%S.%fZ"
|
||||
```
|
||||
- Add logs pipeline:
|
||||
```yaml
|
||||
service:
|
||||
pipelines:
|
||||
logs:
|
||||
receivers: [filelog]
|
||||
processors: [batch]
|
||||
exporters: [otlp/loki]
|
||||
```
|
||||
- Add Loki exporter:
|
||||
```yaml
|
||||
exporters:
|
||||
otlp/loki:
|
||||
endpoint: loki:3100
|
||||
tls:
|
||||
insecure: true
|
||||
```
|
||||
|
||||
- Mount rippled's log directory into the collector container via docker-compose volume
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/otel-collector-config.yaml`
|
||||
- `docker/telemetry/docker-compose.yml`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.4: Configure Grafana Trace-to-Log Correlation
|
||||
|
||||
**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration:
|
||||
|
||||
```yaml
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
tags: ["partition", "severity"]
|
||||
```
|
||||
|
||||
- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo:
|
||||
```yaml
|
||||
derivedFields:
|
||||
- datasourceUid: tempo
|
||||
matcherRegex: "trace_id=(\\w+)"
|
||||
name: TraceID
|
||||
url: "$${__value.raw}"
|
||||
```
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
||||
- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file)
|
||||
|
||||
---
|
||||
|
||||
## Task 8.5: Update Integration Tests
|
||||
|
||||
**Objective**: Verify trace_id appears in logs and Loki correlation works.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/integration-test.sh`:
|
||||
- After sending RPC requests (which create spans), grep rippled's log output for `trace_id=`
|
||||
- Verify trace_id matches a trace visible in Jaeger
|
||||
- Optionally: query Loki via API to confirm log ingestion
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/integration-test.sh`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.6: Update Documentation
|
||||
|
||||
**Objective**: Document the log correlation feature in runbook and reference docs.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docs/telemetry-runbook.md`:
|
||||
- Add "Log-Trace Correlation" section explaining how to use Grafana Tempo -> Loki linking
|
||||
- Add LogQL query examples for filtering by trace_id
|
||||
|
||||
- Edit `OpenTelemetryPlan/09-data-collection-reference.md`:
|
||||
- Add new section "3. Log Correlation" between SpanMetrics and StatsD sections
|
||||
- Document the log format with trace_id injection
|
||||
- Document Loki as a new backend
|
||||
|
||||
- Edit `docker/telemetry/TESTING.md`:
|
||||
- Add log correlation verification steps
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docs/telemetry-runbook.md`
|
||||
- `OpenTelemetryPlan/09-data-collection-reference.md`
|
||||
- `docker/telemetry/TESTING.md`
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Task | Description | Sub-Phase | New Files | Modified Files | Effort | Risk | Depends On |
|
||||
| ---- | ------------------------------------------ | --------- | --------- | -------------- | ------ | ------ | ---------- |
|
||||
| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | 1d | Low | Phase 7 |
|
||||
| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | 0.5d | Low | -- |
|
||||
| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 1d | Medium | 8.1, 8.2 |
|
||||
| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 0.5d | Low | 8.3 |
|
||||
| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 0.5d | Low | 8.4 |
|
||||
| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 1d | Low | 8.5 |
|
||||
|
||||
**Total Effort**: 4.5 days
|
||||
|
||||
**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential.
|
||||
|
||||
**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)):
|
||||
|
||||
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
|
||||
- [ ] Log lines outside spans have no trace context (no empty fields)
|
||||
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
|
||||
- [ ] Grafana Tempo -> Loki one-click correlation works
|
||||
- [ ] Grafana Loki -> Tempo reverse lookup works via derived field
|
||||
- [ ] Integration test verifies trace_id presence in logs
|
||||
- [ ] No performance regression from trace_id injection (< 0.1% overhead)
|
||||
Reference in New Issue
Block a user