Files
rippled/OpenTelemetryPlan/Phase8_taskList.md
Pratik Mankawde a8549a7ab2 fix(telemetry): address code review findings for Phase 8 log-trace correlation
- Replace GetSpan() with direct context value check in Logs::format()
  to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
  and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
  detect missing log files, fix silent grep error masking

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:32:46 +01:00

9.0 KiB

Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List

Goal: Inject trace context (trace_id, span_id) into xrpld's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.

Scope: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the beast::Journal public API.

Branch: pratik/otel-phase8-log-correlation (from pratik/otel-phase7-native-metrics)

Document Relevance
06-implementation-phases.md Phase 8 plan: motivation, architecture, exit criteria (§6.8.1)
07-observability-backends.md Loki backend recommendation, Grafana data source provisioning
Phase7_taskList.md Prerequisite — native OTel metrics pipeline must be working
05-configuration-reference.md [telemetry] config (trace_id injection toggle)

Task 8.1: Inject trace_id into Logs::format()

Objective: Add OTel trace context to every log line that is emitted within an active span.

What to do:

  • Edit src/libxrpl/basics/Log.cpp:

    • In Logs::format() (around line 346), after severity is appended, check for active OTel span. The implementation checks the context value directly to avoid the heap allocation that GetSpan() performs on the no-span path:
      #ifdef XRPL_ENABLE_TELEMETRY
      {
          auto context = opentelemetry::context::RuntimeContext::GetCurrent();
          auto spanValue = context.GetValue(opentelemetry::trace::kSpanKey);
          if (opentelemetry::nostd::holds_alternative<
                  opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>>(spanValue))
          {
              auto span = opentelemetry::nostd::get<
                  opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>>(spanValue);
              auto spanCtx = span->GetContext();
              if (spanCtx.IsValid())
              {
                  char traceId[32], spanId[16];
                  spanCtx.trace_id().ToLowerBase16(
                      opentelemetry::nostd::span<char, 32>{traceId});
                  spanCtx.span_id().ToLowerBase16(
                      opentelemetry::nostd::span<char, 16>{spanId});
                  output += "trace_id=";
                  output.append(traceId, 32);
                  output += " span_id=";
                  output.append(spanId, 16);
                  output += ' ';
              }
          }
      }
      #endif
      
    • Add #include for OTel context headers, guarded by #ifdef XRPL_ENABLE_TELEMETRY
  • Edit include/xrpl/basics/Log.h:

    • No changes needed — format() signature unchanged

Key modified files:

  • src/libxrpl/basics/Log.cpp

Performance note: The implementation checks the thread-local context value directly (avoiding the heap allocation that GetSpan() performs on the no-span path). On threads without an active span (~99% of log lines), the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, an additional shared_ptr copy + GetContext() + IsValid() adds ~50ns total. Overhead is negligible at typical logging rates.


Task 8.2: Add Loki to Docker Compose Stack

Objective: Add Grafana Loki as a log storage backend in the development observability stack.

What to do:

  • Edit docker/telemetry/docker-compose.yml:

    • Add Loki service:
      loki:
        image: grafana/loki:2.9.0
        ports:
          - "3100:3100"
        command: -config.file=/etc/loki/local-config.yaml
      
    • Add Loki as a Grafana data source in provisioning
  • Create docker/telemetry/grafana/provisioning/datasources/loki.yaml:

    • Configure Loki data source with derived fields linking trace_id to Tempo

Key new files:

  • docker/telemetry/grafana/provisioning/datasources/loki.yaml

Key modified files:

  • docker/telemetry/docker-compose.yml

Task 8.3: Add Filelog Receiver to OTel Collector

Objective: Configure the OTel Collector to tail xrpld's log file and export to Loki.

What to do:

  • Edit docker/telemetry/otel-collector-config.yaml:

    • Add filelog receiver:
      receivers:
        filelog:
          include: [/var/log/rippled/debug.log]
          operators:
            - type: regex_parser
              regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
              timestamp:
                parse_from: attributes.timestamp
                layout: "%Y-%m-%dT%H:%M:%S.%fZ"
      
    • Add logs pipeline:
      service:
        pipelines:
          logs:
            receivers: [filelog]
            processors: [batch]
            exporters: [otlp/loki]
      
    • Add Loki exporter:
      exporters:
        otlp/loki:
          endpoint: loki:3100
          tls:
            insecure: true
      
  • Mount xrpld's log directory into the collector container via docker-compose volume

Key modified files:

  • docker/telemetry/otel-collector-config.yaml
  • docker/telemetry/docker-compose.yml

Task 8.4: Configure Grafana Trace-to-Log Correlation

Objective: Enable one-click navigation from Tempo traces to Loki logs in Grafana.

What to do:

  • Edit Grafana Tempo data source provisioning to add tracesToLogs configuration:

    tracesToLogs:
      datasourceUid: loki
      filterByTraceID: true
      filterBySpanID: false
      tags: ["partition", "severity"]
    
  • Edit Grafana Loki data source provisioning to add derivedFields linking trace_id back to Tempo:

    derivedFields:
      - datasourceUid: tempo
        matcherRegex: "trace_id=(\\w+)"
        name: TraceID
        url: "$${__value.raw}"
    

Key modified files:

  • docker/telemetry/grafana/provisioning/datasources/loki.yaml
  • docker/telemetry/grafana/provisioning/datasources/ (Tempo data source file)

Task 8.5: Update Integration Tests

Objective: Verify trace_id appears in logs and Loki correlation works.

What to do:

  • Edit docker/telemetry/integration-test.sh:
    • After sending RPC requests (which create spans), grep xrpld's log output for trace_id=
    • Verify trace_id matches a trace visible in Tempo
    • Optionally: query Loki via API to confirm log ingestion

Key modified files:

  • docker/telemetry/integration-test.sh

Task 8.6: Update Documentation

Objective: Document the log correlation feature in runbook and reference docs.

What to do:

  • Edit docs/telemetry-runbook.md:

    • Add "Log-Trace Correlation" section explaining how to use Grafana Tempo -> Loki linking
    • Add LogQL query examples for filtering by trace_id
  • Edit OpenTelemetryPlan/09-data-collection-reference.md:

    • Add new section "3. Log Correlation" between SpanMetrics and StatsD sections
    • Document the log format with trace_id injection
    • Document Loki as a new backend
  • Edit docker/telemetry/TESTING.md:

    • Add log correlation verification steps

Key modified files:

  • docs/telemetry-runbook.md
  • OpenTelemetryPlan/09-data-collection-reference.md
  • docker/telemetry/TESTING.md

Summary Table

Task Description Sub-Phase New Files Modified Files Depends On
8.1 Inject trace_id into Logs::format() 8a 0 1 Phase 7
8.2 Add Loki to Docker Compose stack 8b 1 1 --
8.3 Add filelog receiver to OTel Collector 8b 0 2 8.1, 8.2
8.4 Configure Grafana trace-to-log correlation 8b 0 2 8.3
8.5 Update integration tests 8a + 8b 0 1 8.4
8.6 Update documentation 8a + 8b 0 3 8.5

Parallel work: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3-8.6 are sequential.

Exit Criteria (from 06-implementation-phases.md §6.8.1):

  • Log lines within active spans contain trace_id=<hex> span_id=<hex>
  • Log lines outside spans have no trace context (no empty fields)
  • Loki ingests xrpld logs via OTel Collector filelog receiver
  • Grafana Tempo -> Loki one-click correlation works
  • Grafana Loki -> Tempo reverse lookup works via derived field
  • Integration test verifies trace_id presence in logs
  • No performance regression from trace_id injection (< 0.1% overhead)