mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -54,6 +54,15 @@ gantt
|
||||
|
||||
section Phase 5
|
||||
Documentation & Deploy :p5, after p4, 1w
|
||||
|
||||
section Phase 6
|
||||
StatsD Metrics Bridge :p6, after p5, 1w
|
||||
|
||||
section Phase 7
|
||||
Native OTel Metrics :p7, after p6, 2w
|
||||
|
||||
section Phase 8
|
||||
Log-Trace Correlation :p8, after p7, 1w
|
||||
```
|
||||
|
||||
---
|
||||
@@ -224,6 +233,16 @@ gantt
|
||||
| `consensus.accept.apply` | `RCLConsensus.cpp:453` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq` |
|
||||
| `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` |
|
||||
|
||||
### Spans Produced
|
||||
|
||||
| Span Name | Location | Attributes |
|
||||
| --------------------------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `consensus.proposal.send` | `RCLConsensus.cpp:177` | `xrpl.consensus.round` |
|
||||
| `consensus.ledger_close` | `RCLConsensus.cpp:282` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
|
||||
| `consensus.accept` | `RCLConsensus.cpp:395` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` |
|
||||
| `consensus.accept.apply` | `RCLConsensus.cpp:453` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq` |
|
||||
| `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` |
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [x] Complete consensus round traces
|
||||
@@ -589,6 +608,114 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown.
|
||||
|
||||
---
|
||||
|
||||
## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13)
|
||||
|
||||
### Motivation
|
||||
|
||||
rippled's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver.
|
||||
|
||||
#### Gains
|
||||
|
||||
1. **One-click trace-to-log navigation** — Click a trace in Tempo/Jaeger and immediately see the corresponding log lines in Loki, filtered by `trace_id`.
|
||||
2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo.
|
||||
3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance.
|
||||
4. **Zero new dependencies in rippled** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1.
|
||||
5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min.
|
||||
|
||||
#### Losses / Risks
|
||||
|
||||
1. **Log format change** — Existing log parsers that rely on a fixed format will need updating to handle the optional `trace_id=... span_id=...` fields.
|
||||
2. **Loki resource usage** — Log ingestion adds storage and memory overhead to the observability stack (mitigated by retention policies).
|
||||
3. **Filelog receiver complexity** — The regex parser must be kept in sync with the log format; a format change in `Logs::format()` could break parsing.
|
||||
|
||||
#### Decision
|
||||
|
||||
The correlation value far outweighs the risks. The log format change is backward-compatible (fields are appended only when a span is active), and the filelog receiver regex is straightforward to maintain.
|
||||
|
||||
### Architecture
|
||||
|
||||
Phase 8 has two independent sub-phases that can be developed in parallel:
|
||||
|
||||
- **Phase 8a (code change)**: Modify `Logs::format()` in `src/libxrpl/basics/Log.cpp` to append `trace_id=<hex32> span_id=<hex16>` when the current thread has an active OTel span. Guarded by `#ifdef XRPL_ENABLE_TELEMETRY`.
|
||||
- **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail rippled's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana Tempo↔Loki bidirectional linking.
|
||||
|
||||
#### Trace ID Injection Flow
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph rippled["rippled process"]
|
||||
JLOG["JLOG(j.info())"]
|
||||
Format["Logs::format()"]
|
||||
OTelCtx["OTel Context<br/>(thread-local)"]
|
||||
JLOG --> Format
|
||||
OTelCtx -.->|"GetSpan()→GetContext()"| Format
|
||||
end
|
||||
|
||||
subgraph output["Log Output"]
|
||||
LogLine["2024-01-15T10:30:45.123Z<br/>LedgerMaster:NFO<br/>trace_id=abc123...<br/>span_id=def456...<br/>Validated ledger 42"]
|
||||
end
|
||||
|
||||
Format --> LogLine
|
||||
|
||||
style rippled fill:#1a237e,stroke:#0d1642,color:#fff
|
||||
style output fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style JLOG fill:#283593,stroke:#1a237e,color:#fff
|
||||
style Format fill:#283593,stroke:#1a237e,color:#fff
|
||||
style OTelCtx fill:#283593,stroke:#1a237e,color:#fff
|
||||
style LogLine fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
```
|
||||
|
||||
#### Loki Ingestion Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph collector["OTel Collector"]
|
||||
FR["filelog receiver<br/>tails debug.log"]
|
||||
RP["regex_parser<br/>extracts trace_id,<br/>span_id, severity"]
|
||||
BP["batch processor"]
|
||||
LE["otlp/loki exporter"]
|
||||
FR --> RP --> BP --> LE
|
||||
end
|
||||
|
||||
LogFile["rippled<br/>debug.log"] --> FR
|
||||
LE --> Loki["Grafana Loki<br/>:3100"]
|
||||
Loki <-->|"derivedFields ↔<br/>tracesToLogs"| Tempo["Grafana Tempo"]
|
||||
|
||||
style collector fill:#e65100,stroke:#bf360c,color:#fff
|
||||
style FR fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style RP fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style BP fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style LE fill:#f57c00,stroke:#e65100,color:#fff
|
||||
style LogFile fill:#1a237e,stroke:#0d1642,color:#fff
|
||||
style Loki fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style Tempo fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description |
|
||||
| ---- | ---------------------------------------------- |
|
||||
| 8.1 | Inject trace_id into Logs::format() |
|
||||
| 8.2 | Add Loki to Docker Compose stack |
|
||||
| 8.3 | Add filelog receiver to OTel Collector |
|
||||
| 8.4 | Configure Grafana trace-to-log correlation |
|
||||
| 8.5 | Update integration tests |
|
||||
| 8.6 | Update documentation (runbook, reference docs) |
|
||||
|
||||
**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3–8.6 are sequential.
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
|
||||
- [ ] Log lines outside spans have no trace context (no empty fields)
|
||||
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
|
||||
- [ ] Grafana Tempo → Loki one-click correlation works
|
||||
- [ ] Grafana Loki → Tempo reverse lookup works via derived field
|
||||
- [ ] Integration test verifies trace_id presence in logs
|
||||
- [ ] No performance regression from trace_id injection (< 0.1% overhead)
|
||||
|
||||
---
|
||||
|
||||
## 6.9 Risk Assessment
|
||||
|
||||
```mermaid
|
||||
@@ -875,6 +1002,7 @@ Clear, measurable criteria for each phase.
|
||||
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
|
||||
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 |
|
||||
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 |
|
||||
| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -196,6 +196,7 @@ flowchart TB
|
||||
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
|
||||
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
|
||||
| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration |
|
||||
| [Phase8_taskList.md](./Phase8_taskList.md) | Log-trace correlation |
|
||||
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
|
||||
|
||||
---
|
||||
|
||||
@@ -494,6 +494,79 @@ rippled_State_Accounting_Full_duration
|
||||
|
||||
---
|
||||
|
||||
## 5a. Log-Trace Correlation (Phase 8)
|
||||
|
||||
> **Plan details**: [06-implementation-phases.md §6.8.1](./06-implementation-phases.md) — motivation, architecture, Mermaid diagrams
|
||||
> **Task breakdown**: [Phase8_taskList.md](./Phase8_taskList.md) — per-task implementation details
|
||||
|
||||
Phase 8 injects OTel trace context into rippled's `Logs::format()` output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:
|
||||
|
||||
### Log Format
|
||||
|
||||
```
|
||||
<timestamp> <partition>:<severity> trace_id=<32hex> span_id=<16hex> <message>
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42
|
||||
```
|
||||
|
||||
- **`trace_id=<hex32>`** — 32-character lowercase hex trace identifier. Links to the distributed trace in Tempo/Jaeger.
|
||||
- **`span_id=<hex16>`** — 16-character lowercase hex span identifier. Identifies the specific span within the trace.
|
||||
- **Only present** when the log is emitted within an active OTel span. Log lines outside of traced code paths have no trace context fields.
|
||||
|
||||
### Implementation
|
||||
|
||||
The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It reads the current span from OTel's thread-local runtime context via `opentelemetry::trace::GetSpan()` and `opentelemetry::context::RuntimeContext::GetCurrent()`. Both calls are lock-free thread-local reads measured at <10ns per call.
|
||||
|
||||
### Log Ingestion Pipeline
|
||||
|
||||
```
|
||||
rippled debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki
|
||||
```
|
||||
|
||||
The OTel Collector's `filelog` receiver tails `debug.log` files and uses a `regex_parser` operator to extract structured fields:
|
||||
|
||||
| Field | Type | Description |
|
||||
| ----------- | -------- | -------------------------------------------------------- |
|
||||
| `timestamp` | datetime | Log timestamp |
|
||||
| `partition` | string | Log partition (e.g., `LedgerMaster`, `PeerImp`) |
|
||||
| `severity` | string | Severity code (`TRC`, `DBG`, `NFO`, `WRN`, `ERR`, `FTL`) |
|
||||
| `trace_id` | string | 32-hex trace identifier (optional) |
|
||||
| `span_id` | string | 16-hex span identifier (optional) |
|
||||
| `message` | string | Log message body |
|
||||
|
||||
### Grafana Correlation
|
||||
|
||||
Bidirectional linking between logs and traces is configured via Grafana datasource provisioning:
|
||||
|
||||
- **Tempo -> Loki** (`tracesToLogs`): Clicking "Logs for this trace" on a Tempo trace view filters Loki logs by `trace_id`, showing all log lines from that trace.
|
||||
- **Loki -> Tempo** (`derivedFields`): A regex-based derived field on the Loki datasource extracts `trace_id` from log lines and renders it as a clickable link to the corresponding trace in Tempo.
|
||||
|
||||
### Loki Backend
|
||||
|
||||
Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries from the OTel Collector's `loki` exporter via the push API at `http://loki:3100/loki/api/v1/push`.
|
||||
|
||||
### LogQL Query Examples
|
||||
|
||||
```logql
|
||||
# Find all logs for a specific trace
|
||||
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"
|
||||
|
||||
# Error logs with trace context
|
||||
{job="rippled"} |= "ERR" |= "trace_id="
|
||||
|
||||
# Logs from a specific partition with trace context
|
||||
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""
|
||||
|
||||
# Count traced log lines over time
|
||||
count_over_time({job="rippled"} |= "trace_id=" [5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Known Issues
|
||||
|
||||
| Issue | Impact | Status |
|
||||
|
||||
@@ -187,7 +187,7 @@ OpenTelemetry Collector configurations are provided for development and producti
|
||||
|
||||
## 6. Implementation Phases
|
||||
|
||||
The implementation spans 12 weeks across 7 phases:
|
||||
The implementation spans 13 weeks across 8 phases:
|
||||
|
||||
| Phase | Duration | Focus | Key Deliverables |
|
||||
| ----- | ----------- | --------------------- | ----------------------------------------------------------- |
|
||||
@@ -198,8 +198,9 @@ The implementation spans 12 weeks across 7 phases:
|
||||
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
|
||||
| 6 | Week 10 | StatsD Metrics Bridge | OTel Collector StatsD receiver, 3 Grafana dashboards |
|
||||
| 7 | Weeks 11-12 | Native OTel Metrics | OTelCollector impl, OTLP metrics export, StatsD deprecation |
|
||||
| 8 | Week 13 | Log-Trace Correlation | trace_id in logs, Loki ingestion, Tempo↔Loki linking |
|
||||
|
||||
**Total Effort**: 60.6 developer-days with 2 developers
|
||||
**Total Effort**: 65.1 developer-days with 2 developers
|
||||
|
||||
➡️ **[View full Implementation Phases](./06-implementation-phases.md)**
|
||||
|
||||
|
||||
232
OpenTelemetryPlan/Phase8_taskList.md
Normal file
232
OpenTelemetryPlan/Phase8_taskList.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List
|
||||
|
||||
> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.
|
||||
>
|
||||
> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API.
|
||||
>
|
||||
> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`)
|
||||
|
||||
### Related Plan Documents
|
||||
|
||||
| Document | Relevance |
|
||||
| ---------------------------------------------------------------- | -------------------------------------------------------------- |
|
||||
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 plan: motivation, architecture, exit criteria (§6.8.1) |
|
||||
| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning |
|
||||
| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working |
|
||||
| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) |
|
||||
|
||||
---
|
||||
|
||||
## Task 8.1: Inject trace_id into Logs::format()
|
||||
|
||||
**Objective**: Add OTel trace context to every log line that is emitted within an active span.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/libxrpl/basics/Log.cpp`:
|
||||
- In `Logs::format()` (around line 346), after severity is appended, check for active OTel span:
|
||||
```cpp
|
||||
#ifdef XRPL_ENABLE_TELEMETRY
|
||||
auto span = opentelemetry::trace::GetSpan(
|
||||
opentelemetry::context::RuntimeContext::GetCurrent());
|
||||
auto ctx = span->GetContext();
|
||||
if (ctx.IsValid())
|
||||
{
|
||||
// Append trace context as structured fields
|
||||
char traceId[33], spanId[17];
|
||||
ctx.trace_id().ToLowerBase16(traceId);
|
||||
ctx.span_id().ToLowerBase16(spanId);
|
||||
output += "trace_id=";
|
||||
output.append(traceId, 32);
|
||||
output += " span_id=";
|
||||
output.append(spanId, 16);
|
||||
output += ' ';
|
||||
}
|
||||
#endif
|
||||
```
|
||||
- Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY`
|
||||
|
||||
- Edit `include/xrpl/basics/Log.h`:
|
||||
- No changes needed — format() signature unchanged
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `src/libxrpl/basics/Log.cpp`
|
||||
|
||||
**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead.
|
||||
|
||||
---
|
||||
|
||||
## Task 8.2: Add Loki to Docker Compose Stack
|
||||
|
||||
**Objective**: Add Grafana Loki as a log storage backend in the development observability stack.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/docker-compose.yml`:
|
||||
- Add Loki service:
|
||||
```yaml
|
||||
loki:
|
||||
image: grafana/loki:2.9.0
|
||||
ports:
|
||||
- "3100:3100"
|
||||
command: -config.file=/etc/loki/local-config.yaml
|
||||
```
|
||||
- Add Loki as a Grafana data source in provisioning
|
||||
|
||||
- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`:
|
||||
- Configure Loki data source with derived fields linking `trace_id` to Tempo
|
||||
|
||||
**Key new files**:
|
||||
|
||||
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/docker-compose.yml`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.3: Add Filelog Receiver to OTel Collector
|
||||
|
||||
**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/otel-collector-config.yaml`:
|
||||
- Add `filelog` receiver:
|
||||
```yaml
|
||||
receivers:
|
||||
filelog:
|
||||
include: [/var/log/rippled/debug.log]
|
||||
operators:
|
||||
- type: regex_parser
|
||||
regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
|
||||
timestamp:
|
||||
parse_from: attributes.timestamp
|
||||
layout: "%Y-%m-%dT%H:%M:%S.%fZ"
|
||||
```
|
||||
- Add logs pipeline:
|
||||
```yaml
|
||||
service:
|
||||
pipelines:
|
||||
logs:
|
||||
receivers: [filelog]
|
||||
processors: [batch]
|
||||
exporters: [otlp/loki]
|
||||
```
|
||||
- Add Loki exporter:
|
||||
```yaml
|
||||
exporters:
|
||||
otlp/loki:
|
||||
endpoint: loki:3100
|
||||
tls:
|
||||
insecure: true
|
||||
```
|
||||
|
||||
- Mount rippled's log directory into the collector container via docker-compose volume
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/otel-collector-config.yaml`
|
||||
- `docker/telemetry/docker-compose.yml`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.4: Configure Grafana Trace-to-Log Correlation
|
||||
|
||||
**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration:
|
||||
|
||||
```yaml
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
tags: ["partition", "severity"]
|
||||
```
|
||||
|
||||
- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo:
|
||||
```yaml
|
||||
derivedFields:
|
||||
- datasourceUid: tempo
|
||||
matcherRegex: "trace_id=(\\w+)"
|
||||
name: TraceID
|
||||
url: "$${__value.raw}"
|
||||
```
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
||||
- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file)
|
||||
|
||||
---
|
||||
|
||||
## Task 8.5: Update Integration Tests
|
||||
|
||||
**Objective**: Verify trace_id appears in logs and Loki correlation works.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docker/telemetry/integration-test.sh`:
|
||||
- After sending RPC requests (which create spans), grep rippled's log output for `trace_id=`
|
||||
- Verify trace_id matches a trace visible in Jaeger
|
||||
- Optionally: query Loki via API to confirm log ingestion
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docker/telemetry/integration-test.sh`
|
||||
|
||||
---
|
||||
|
||||
## Task 8.6: Update Documentation
|
||||
|
||||
**Objective**: Document the log correlation feature in runbook and reference docs.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `docs/telemetry-runbook.md`:
|
||||
- Add "Log-Trace Correlation" section explaining how to use Grafana Tempo -> Loki linking
|
||||
- Add LogQL query examples for filtering by trace_id
|
||||
|
||||
- Edit `OpenTelemetryPlan/09-data-collection-reference.md`:
|
||||
- Add new section "3. Log Correlation" between SpanMetrics and StatsD sections
|
||||
- Document the log format with trace_id injection
|
||||
- Document Loki as a new backend
|
||||
|
||||
- Edit `docker/telemetry/TESTING.md`:
|
||||
- Add log correlation verification steps
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `docs/telemetry-runbook.md`
|
||||
- `OpenTelemetryPlan/09-data-collection-reference.md`
|
||||
- `docker/telemetry/TESTING.md`
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Task | Description | Sub-Phase | New Files | Modified Files | Depends On |
|
||||
| ---- | ------------------------------------------ | --------- | --------- | -------------- | ---------- |
|
||||
| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | Phase 7 |
|
||||
| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | -- |
|
||||
| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 8.1, 8.2 |
|
||||
| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 8.3 |
|
||||
| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 8.4 |
|
||||
| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 8.5 |
|
||||
|
||||
**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential.
|
||||
|
||||
**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)):
|
||||
|
||||
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
|
||||
- [ ] Log lines outside spans have no trace context (no empty fields)
|
||||
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
|
||||
- [ ] Grafana Tempo -> Loki one-click correlation works
|
||||
- [ ] Grafana Loki -> Tempo reverse lookup works via derived field
|
||||
- [ ] Integration test verifies trace_id presence in logs
|
||||
- [ ] No performance regression from trace_id injection (< 0.1% overhead)
|
||||
Reference in New Issue
Block a user