mirror of
https://github.com/XRPLF/rippled.git
synced 2026-06-02 16:26:48 +00:00
- Replace GetSpan() with direct context value check in Logs::format() to avoid heap allocation (new DefaultSpan) on the no-span path - Restore Phase 7 documentation accidentally deleted during merge - Fix undefined $JAEGER variable → use $TEMPO in integration test - Remove useless LCOV_EXCL markers around #ifdef block - Fix indentation inconsistencies in Log.cpp injection block - Remove incorrect url field from loki.yaml derivedFields - Update stale code sample in Phase8_taskList.md to match implementation - Correct "<10ns" performance claims to accurate ~15-20ns (no-span) and ~50ns (active-span) measurements across all docs - Replace Jaeger references with Tempo in TESTING.md (port 16686→3200) - Improve error handling in check_log_correlation(): track files_scanned, detect missing log files, fix silent grep error masking Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
242 lines
9.0 KiB
Markdown
242 lines
9.0 KiB
Markdown
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List
|
|
|
|
> **Goal**: Inject trace context (trace_id, span_id) into xrpld's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.
|
|
>
|
|
> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API.
|
|
>
|
|
> **Branch**: `pratik/otel-phase8-log-correlation` (from `pratik/otel-phase7-native-metrics`)
|
|
|
|
### Related Plan Documents
|
|
|
|
| Document | Relevance |
|
|
| ---------------------------------------------------------------- | -------------------------------------------------------------- |
|
|
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 8 plan: motivation, architecture, exit criteria (§6.8.1) |
|
|
| [07-observability-backends.md](./07-observability-backends.md) | Loki backend recommendation, Grafana data source provisioning |
|
|
| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — native OTel metrics pipeline must be working |
|
|
| [05-configuration-reference.md](./05-configuration-reference.md) | `[telemetry]` config (trace_id injection toggle) |
|
|
|
|
---
|
|
|
|
## Task 8.1: Inject trace_id into Logs::format()
|
|
|
|
**Objective**: Add OTel trace context to every log line that is emitted within an active span.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `src/libxrpl/basics/Log.cpp`:
|
|
- In `Logs::format()` (around line 346), after severity is appended, check for active OTel span. The implementation checks the context value directly to avoid the heap allocation that `GetSpan()` performs on the no-span path:
|
|
```cpp
|
|
#ifdef XRPL_ENABLE_TELEMETRY
|
|
{
|
|
auto context = opentelemetry::context::RuntimeContext::GetCurrent();
|
|
auto spanValue = context.GetValue(opentelemetry::trace::kSpanKey);
|
|
if (opentelemetry::nostd::holds_alternative<
|
|
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>>(spanValue))
|
|
{
|
|
auto span = opentelemetry::nostd::get<
|
|
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>>(spanValue);
|
|
auto spanCtx = span->GetContext();
|
|
if (spanCtx.IsValid())
|
|
{
|
|
char traceId[32], spanId[16];
|
|
spanCtx.trace_id().ToLowerBase16(
|
|
opentelemetry::nostd::span<char, 32>{traceId});
|
|
spanCtx.span_id().ToLowerBase16(
|
|
opentelemetry::nostd::span<char, 16>{spanId});
|
|
output += "trace_id=";
|
|
output.append(traceId, 32);
|
|
output += " span_id=";
|
|
output.append(spanId, 16);
|
|
output += ' ';
|
|
}
|
|
}
|
|
}
|
|
#endif
|
|
```
|
|
- Add `#include` for OTel context headers, guarded by `#ifdef XRPL_ENABLE_TELEMETRY`
|
|
|
|
- Edit `include/xrpl/basics/Log.h`:
|
|
- No changes needed — format() signature unchanged
|
|
|
|
**Key modified files**:
|
|
|
|
- `src/libxrpl/basics/Log.cpp`
|
|
|
|
**Performance note**: The implementation checks the thread-local context value directly (avoiding the heap allocation that `GetSpan()` performs on the no-span path). On threads without an active span (~99% of log lines), the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, an additional shared_ptr copy + `GetContext()` + `IsValid()` adds ~50ns total. Overhead is negligible at typical logging rates.
|
|
|
|
---
|
|
|
|
## Task 8.2: Add Loki to Docker Compose Stack
|
|
|
|
**Objective**: Add Grafana Loki as a log storage backend in the development observability stack.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docker/telemetry/docker-compose.yml`:
|
|
- Add Loki service:
|
|
```yaml
|
|
loki:
|
|
image: grafana/loki:2.9.0
|
|
ports:
|
|
- "3100:3100"
|
|
command: -config.file=/etc/loki/local-config.yaml
|
|
```
|
|
- Add Loki as a Grafana data source in provisioning
|
|
|
|
- Create `docker/telemetry/grafana/provisioning/datasources/loki.yaml`:
|
|
- Configure Loki data source with derived fields linking `trace_id` to Tempo
|
|
|
|
**Key new files**:
|
|
|
|
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/docker-compose.yml`
|
|
|
|
---
|
|
|
|
## Task 8.3: Add Filelog Receiver to OTel Collector
|
|
|
|
**Objective**: Configure the OTel Collector to tail xrpld's log file and export to Loki.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docker/telemetry/otel-collector-config.yaml`:
|
|
- Add `filelog` receiver:
|
|
```yaml
|
|
receivers:
|
|
filelog:
|
|
include: [/var/log/rippled/debug.log]
|
|
operators:
|
|
- type: regex_parser
|
|
regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
|
|
timestamp:
|
|
parse_from: attributes.timestamp
|
|
layout: "%Y-%m-%dT%H:%M:%S.%fZ"
|
|
```
|
|
- Add logs pipeline:
|
|
```yaml
|
|
service:
|
|
pipelines:
|
|
logs:
|
|
receivers: [filelog]
|
|
processors: [batch]
|
|
exporters: [otlp/loki]
|
|
```
|
|
- Add Loki exporter:
|
|
```yaml
|
|
exporters:
|
|
otlp/loki:
|
|
endpoint: loki:3100
|
|
tls:
|
|
insecure: true
|
|
```
|
|
|
|
- Mount xrpld's log directory into the collector container via docker-compose volume
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/otel-collector-config.yaml`
|
|
- `docker/telemetry/docker-compose.yml`
|
|
|
|
---
|
|
|
|
## Task 8.4: Configure Grafana Trace-to-Log Correlation
|
|
|
|
**Objective**: Enable one-click navigation from Tempo traces to Loki logs in Grafana.
|
|
|
|
**What to do**:
|
|
|
|
- Edit Grafana Tempo data source provisioning to add `tracesToLogs` configuration:
|
|
|
|
```yaml
|
|
tracesToLogs:
|
|
datasourceUid: loki
|
|
filterByTraceID: true
|
|
filterBySpanID: false
|
|
tags: ["partition", "severity"]
|
|
```
|
|
|
|
- Edit Grafana Loki data source provisioning to add `derivedFields` linking trace_id back to Tempo:
|
|
```yaml
|
|
derivedFields:
|
|
- datasourceUid: tempo
|
|
matcherRegex: "trace_id=(\\w+)"
|
|
name: TraceID
|
|
url: "$${__value.raw}"
|
|
```
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/grafana/provisioning/datasources/loki.yaml`
|
|
- `docker/telemetry/grafana/provisioning/datasources/` (Tempo data source file)
|
|
|
|
---
|
|
|
|
## Task 8.5: Update Integration Tests
|
|
|
|
**Objective**: Verify trace_id appears in logs and Loki correlation works.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docker/telemetry/integration-test.sh`:
|
|
- After sending RPC requests (which create spans), grep xrpld's log output for `trace_id=`
|
|
- Verify trace_id matches a trace visible in Tempo
|
|
- Optionally: query Loki via API to confirm log ingestion
|
|
|
|
**Key modified files**:
|
|
|
|
- `docker/telemetry/integration-test.sh`
|
|
|
|
---
|
|
|
|
## Task 8.6: Update Documentation
|
|
|
|
**Objective**: Document the log correlation feature in runbook and reference docs.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `docs/telemetry-runbook.md`:
|
|
- Add "Log-Trace Correlation" section explaining how to use Grafana Tempo -> Loki linking
|
|
- Add LogQL query examples for filtering by trace_id
|
|
|
|
- Edit `OpenTelemetryPlan/09-data-collection-reference.md`:
|
|
- Add new section "3. Log Correlation" between SpanMetrics and StatsD sections
|
|
- Document the log format with trace_id injection
|
|
- Document Loki as a new backend
|
|
|
|
- Edit `docker/telemetry/TESTING.md`:
|
|
- Add log correlation verification steps
|
|
|
|
**Key modified files**:
|
|
|
|
- `docs/telemetry-runbook.md`
|
|
- `OpenTelemetryPlan/09-data-collection-reference.md`
|
|
- `docker/telemetry/TESTING.md`
|
|
|
|
---
|
|
|
|
## Summary Table
|
|
|
|
| Task | Description | Sub-Phase | New Files | Modified Files | Depends On |
|
|
| ---- | ------------------------------------------ | --------- | --------- | -------------- | ---------- |
|
|
| 8.1 | Inject trace_id into Logs::format() | 8a | 0 | 1 | Phase 7 |
|
|
| 8.2 | Add Loki to Docker Compose stack | 8b | 1 | 1 | -- |
|
|
| 8.3 | Add filelog receiver to OTel Collector | 8b | 0 | 2 | 8.1, 8.2 |
|
|
| 8.4 | Configure Grafana trace-to-log correlation | 8b | 0 | 2 | 8.3 |
|
|
| 8.5 | Update integration tests | 8a + 8b | 0 | 1 | 8.4 |
|
|
| 8.6 | Update documentation | 8a + 8b | 0 | 3 | 8.5 |
|
|
|
|
**Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential.
|
|
|
|
**Exit Criteria** (from [06-implementation-phases.md §6.8.1](./06-implementation-phases.md)):
|
|
|
|
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
|
|
- [ ] Log lines outside spans have no trace context (no empty fields)
|
|
- [ ] Loki ingests xrpld logs via OTel Collector filelog receiver
|
|
- [ ] Grafana Tempo -> Loki one-click correlation works
|
|
- [ ] Grafana Loki -> Tempo reverse lookup works via derived field
|
|
- [ ] Integration test verifies trace_id presence in logs
|
|
- [ ] No performance regression from trace_id injection (< 0.1% overhead)
|