diff --git a/.clang-tidy b/.clang-tidy index 6a7005b464..6a967532db 100644 --- a/.clang-tidy +++ b/.clang-tidy @@ -7,7 +7,6 @@ Checks: "-*, bugprone-bad-signal-to-kill-thread, bugprone-bool-pointer-implicit-conversion, bugprone-casting-through-void, - bugprone-capturing-this-in-member-variable, bugprone-chained-comparison, bugprone-compare-pointer-to-member-virtual-function, bugprone-copy-constructor-init, @@ -29,7 +28,6 @@ Checks: "-*, bugprone-misplaced-operator-in-strlen-in-alloc, bugprone-misplaced-pointer-arithmetic-in-alloc, bugprone-misplaced-widening-cast, - bugprone-misleading-setter-of-reference, bugprone-move-forwarding-reference, bugprone-multi-level-implicit-pointer-conversion, bugprone-multiple-new-in-one-expression, @@ -87,7 +85,6 @@ Checks: "-*, cppcoreguidelines-pro-type-static-cast-downcast, cppcoreguidelines-rvalue-reference-param-not-moved, cppcoreguidelines-use-default-member-init, - cppcoreguidelines-use-enum-class, cppcoreguidelines-virtual-class-destructor, hicpp-ignored-remove-result, misc-const-correctness, @@ -112,7 +109,6 @@ Checks: "-*, modernize-use-nodiscard, modernize-use-override, modernize-use-ranges, - modernize-use-scoped-lock, modernize-use-starts-ends-with, modernize-use-std-numbers, modernize-use-using, @@ -126,7 +122,6 @@ Checks: "-*, performance-move-constructor-init, performance-no-automatic-move, performance-trivially-destructible, - readability-ambiguous-smartptr-reset-call, readability-avoid-nested-conditional-operator, readability-avoid-return-with-void-value, readability-braces-around-statements, diff --git a/.github/scripts/levelization/generate.py b/.github/scripts/levelization/generate.py old mode 100644 new mode 100755 diff --git a/.github/scripts/levelization/results/ordering.txt b/.github/scripts/levelization/results/ordering.txt index 6c94d5a2cd..95f3078128 100644 --- a/.github/scripts/levelization/results/ordering.txt +++ b/.github/scripts/levelization/results/ordering.txt @@ -194,10 +194,15 @@ tests.libxrpl > xrpl.basics tests.libxrpl > xrpl.core tests.libxrpl > xrpld.telemetry tests.libxrpl > xrpl.json +tests.libxrpl > xrpl.ledger tests.libxrpl > xrpl.net +tests.libxrpl > xrpl.nodestore tests.libxrpl > xrpl.protocol tests.libxrpl > xrpl.protocol_autogen +tests.libxrpl > xrpl.server +tests.libxrpl > xrpl.shamap tests.libxrpl > xrpl.telemetry +tests.libxrpl > xrpl.tx xrpl.conditions > xrpl.basics xrpl.conditions > xrpl.protocol xrpl.core > xrpl.basics diff --git a/.github/scripts/rename/README.md b/.github/scripts/rename/README.md index ab685bb0c3..123881094e 100644 --- a/.github/scripts/rename/README.md +++ b/.github/scripts/rename/README.md @@ -1,11 +1,11 @@ ## Renaming ripple(d) to xrpl(d) In the initial phases of development of the XRPL, the open source codebase was -called "rippled" and it remains with that name even today. Today, over 1000 +called "xrpld" and it remains with that name even today. Today, over 1000 nodes run the application, and code contributions have been submitted by developers located around the world. The XRPL community is larger than ever. In light of the decentralized and diversified nature of XRPL, we will rename any -references to `ripple` and `rippled` to `xrpl` and `xrpld`, when appropriate. +references to `ripple` and `xrpld` to `xrpl` and `xrpld`, when appropriate. See [here](https://xls.xrpl.org/xls/XLS-0095-rename-rippled-to-xrpld.html) for more information. @@ -22,17 +22,17 @@ run from the repository root. 2. `.github/scripts/rename/copyright.sh`: This script will remove superfluous copyright notices. 3. `.github/scripts/rename/cmake.sh`: This script will rename all CMake files - from `RippleXXX.cmake` or `RippledXXX.cmake` to `XrplXXX.cmake`, and any - references to `ripple` and `rippled` (with or without capital letters) to + from `RippleXXX.cmake` or `XrpldXXX.cmake` to `XrplXXX.cmake`, and any + references to `ripple` and `xrpld` (with or without capital letters) to `xrpl` and `xrpld`, respectively. The name of the binary will remain as-is, and will only be renamed to `xrpld` by a later script. 4. `.github/scripts/rename/binary.sh`: This script will rename the binary from - `rippled` to `xrpld`, and reverses the symlink so that `rippled` points to + `xrpld` to `xrpld`, and reverses the symlink so that `xrpld` points to the `xrpld` binary. 5. `.github/scripts/rename/namespace.sh`: This script will rename the C++ namespaces from `ripple` to `xrpl`. 6. `.github/scripts/rename/config.sh`: This script will rename the config from - `rippled.cfg` to `xrpld.cfg`, and updating the code accordingly. The old + `xrpld.cfg` to `xrpld.cfg`, and updating the code accordingly. The old filename will still be accepted. 7. `.github/scripts/rename/docs.sh`: This script will rename any lingering references of `ripple(d)` to `xrpl(d)` in code, comments, and documentation. diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index e003d7bd69..96ff90f8b5 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -396,7 +396,187 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi --- -## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) +## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12) + +**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. + +### Motivation: Why Migrate from StatsD to Native OTel Metrics + +The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves. + +#### What We Gain + +1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." + +2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. + +3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). + +4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. + +5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. + +6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. + +7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. + +#### What We Lose + +1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. + +2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. + +3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. + +4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. + +#### Decision + +The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. + +### Architecture + +#### Class Hierarchy (after Phase 7) + +``` +beast::insight::Collector (abstract interface — unchanged) + | + +-- StatsDCollector (existing — retained as fallback, deprecated) + | +-- StatsDCounterImpl -> StatsD |c over UDP + | +-- StatsDGaugeImpl -> StatsD |g over UDP + | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) + | +-- StatsDEventImpl -> StatsD |ms over UDP + | +-- StatsDHookImpl -> 1s periodic callback + | + +-- NullCollector (existing — unchanged, used when disabled) + | +-- NullCounterImpl -> no-op + | +-- NullGaugeImpl -> no-op + | +-- NullMeterImpl -> no-op + | +-- NullEventImpl -> no-op + | +-- NullHookImpl -> no-op + | + +-- OTelCollector (NEW — Phase 7) + +-- OTelCounterImpl -> otel::Counter + +-- OTelGaugeImpl -> otel::ObservableGauge + +-- OTelMeterImpl -> otel::Counter + +-- OTelEventImpl -> otel::Histogram + +-- OTelHookImpl -> 1s periodic callback (same pattern) +``` + +#### Data Flow (after Phase 7) + +```mermaid +graph LR + subgraph xrpldNode["xrpld Node"] + A["Trace Macros
XRPL_TRACE_SPAN"] + B["beast::insight
OTelCollector"] + end + + subgraph collector["OTel Collector :4317 / :4318"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + BP["Batch Processor"] + SM["SpanMetrics Connector"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends"] + D["Jaeger / Tempo"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000"] + end + + A -->|"OTLP/HTTP :4318
(traces)"| R1 + B -->|"OTLP/HTTP :4318
(metrics)"| R1 + + BP -->|"OTLP/gRPC"| D + SM -->|"RED metrics"| E + R1 -->|"xrpld_* metrics
(native OTLP)"| E + + E --> F + D --> F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. + +#### Configuration + +```ini +# [insight] section — new "otel" server option +[insight] +server=otel # NEW: uses OTel OTLP metrics exporter +prefix=xrpld # metric name prefix (preserved) + +# Endpoint and auth inherited from [telemetry] section: +[telemetry] +enabled=1 +endpoint=http://localhost:4318/v1/traces +``` + +The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. + +**Backward compatibility**: `server=statsd` continues to work exactly as before. + +See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. + +### Instrument Type Mapping + +| beast::insight | OTel Metrics SDK | Rationale | +| ---------------------- | -------------------------------- | ---------------------------------------------------------------- | +| Counter (int64, `\|c`) | `Counter` | Direct 1:1 mapping | +| Gauge (uint64, `\|g`) | `ObservableGauge` | Async callback matches existing Hook polling pattern | +| Meter (uint64, `\|m`) | `Counter` | Fixes non-standard wire format; meters are semantically counters | +| Event (ms, `\|ms`) | `Histogram` | Duration distributions with explicit bucket boundaries | +| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval | + +### Tasks + +| Task | Description | +| ---- | ------------------------------------------------------------------------- | +| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) | +| 7.2 | Implement `OTelCollector` class (~400-500 lines) | +| 7.3 | Update `CollectorManager` — add `server=otel` | +| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) | +| 7.5 | Preserve metric names in Prometheus (naming strategy) | +| 7.6 | Update Grafana dashboards (if names change) | +| 7.7 | Update integration tests | +| 7.8 | Update documentation (runbook, reference docs) | + +### Exit Criteria + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant + +--- + +## 6.9 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) ### Motivation @@ -408,7 +588,7 @@ xrpld's `beast::Journal` logs and OpenTelemetry traces are currently two disjoin 2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo. 3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance. 4. **Zero new dependencies in xrpld** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1. -5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min. +5. **Negligible overhead** — The implementation checks the thread-local context value directly, avoiding heap allocation on the no-span path (~15-20ns). On the active-span path, total cost is ~50ns per log call. At typical logging rates, overhead is negligible. #### Losses / Risks diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 1fe4bb83e8..10cab2e47d 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -517,7 +517,7 @@ Example: ### Implementation -The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It reads the current span from OTel's thread-local runtime context via `opentelemetry::trace::GetSpan()` and `opentelemetry::context::RuntimeContext::GetCurrent()`. Both calls are lock-free thread-local reads measured at <10ns per call. +The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It checks the thread-local runtime context value directly (via `RuntimeContext::GetCurrent().GetValue(kSpanKey)`) to avoid the heap allocation that `GetSpan()` performs on the no-span path. On threads without an active span, the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, total cost is ~50ns per log call. ### Log Ingestion Pipeline diff --git a/OpenTelemetryPlan/Phase8_taskList.md b/OpenTelemetryPlan/Phase8_taskList.md index d7c4770584..3f68f2c7ac 100644 --- a/OpenTelemetryPlan/Phase8_taskList.md +++ b/OpenTelemetryPlan/Phase8_taskList.md @@ -24,23 +24,32 @@ **What to do**: - Edit `src/libxrpl/basics/Log.cpp`: - - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span: + - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span. The implementation checks the context value directly to avoid the heap allocation that `GetSpan()` performs on the no-span path: ```cpp #ifdef XRPL_ENABLE_TELEMETRY - auto span = opentelemetry::trace::GetSpan( - opentelemetry::context::RuntimeContext::GetCurrent()); - auto ctx = span->GetContext(); - if (ctx.IsValid()) { - // Append trace context as structured fields - char traceId[33], spanId[17]; - ctx.trace_id().ToLowerBase16(traceId); - ctx.span_id().ToLowerBase16(spanId); - output += "trace_id="; - output.append(traceId, 32); - output += " span_id="; - output.append(spanId, 16); - output += ' '; + auto context = opentelemetry::context::RuntimeContext::GetCurrent(); + auto spanValue = context.GetValue(opentelemetry::trace::kSpanKey); + if (opentelemetry::nostd::holds_alternative< + opentelemetry::nostd::shared_ptr>(spanValue)) + { + auto span = opentelemetry::nostd::get< + opentelemetry::nostd::shared_ptr>(spanValue); + auto spanCtx = span->GetContext(); + if (spanCtx.IsValid()) + { + char traceId[32], spanId[16]; + spanCtx.trace_id().ToLowerBase16( + opentelemetry::nostd::span{traceId}); + spanCtx.span_id().ToLowerBase16( + opentelemetry::nostd::span{spanId}); + output += "trace_id="; + output.append(traceId, 32); + output += " span_id="; + output.append(spanId, 16); + output += ' '; + } + } } #endif ``` @@ -53,7 +62,7 @@ - `src/libxrpl/basics/Log.cpp` -**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead. +**Performance note**: The implementation checks the thread-local context value directly (avoiding the heap allocation that `GetSpan()` performs on the no-span path). On threads without an active span (~99% of log lines), the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, an additional shared_ptr copy + `GetContext()` + `IsValid()` adds ~50ns total. Overhead is negligible at typical logging rates. --- diff --git a/docker/telemetry/TESTING.md b/docker/telemetry/TESTING.md index 418447e59f..9514954e8e 100644 --- a/docker/telemetry/TESTING.md +++ b/docker/telemetry/TESTING.md @@ -492,17 +492,17 @@ severity code and the message. Example: Lines emitted outside of an active span (background tasks, startup) will NOT have trace context — this is expected. -### Step 2: Cross-check trace_id in Jaeger +### Step 2: Cross-check trace_id in Tempo -Extract a `trace_id` from the log and verify it exists in Jaeger: +Extract a `trace_id` from the log and verify it exists in Tempo: ```bash TRACE_ID=$(grep -o 'trace_id=[a-f0-9]\{32\}' /path/to/debug.log | head -1 | cut -d= -f2) echo "Checking trace: $TRACE_ID" -curl -s "http://localhost:16686/api/traces/$TRACE_ID" | jq '.data | length' +curl -s "http://localhost:3200/api/traces/$TRACE_ID" | jq '.data | length' ``` -Expected result: `1` (the trace exists in Jaeger). +Expected result: `1` (the trace exists in Tempo). ### Step 3: Verify Loki log ingestion @@ -540,7 +540,7 @@ Expected: > 0 results. | `trace_id=` in debug.log | Present in log lines within active spans | | `span_id=` in debug.log | Present alongside trace_id | | Logs without active span | No trace_id/span_id fields | -| trace_id in Jaeger | Matches a valid trace | +| trace_id in Tempo | Matches a valid trace | | Loki log ingestion | Logs visible via LogQL | | Tempo -> Loki "Logs for trace" | Shows correlated log lines | | Loki -> Tempo TraceID link | Navigates to correct trace | diff --git a/docker/telemetry/docker-compose.yml b/docker/telemetry/docker-compose.yml index 5ea1ab3111..4fa3292888 100644 --- a/docker/telemetry/docker-compose.yml +++ b/docker/telemetry/docker-compose.yml @@ -2,12 +2,15 @@ # # Provides services for local development: # - otel-collector: receives OTLP traces from xrpld, batches and -# forwards them to Tempo. Listens on ports 4317 (gRPC) -# and 4318 (HTTP). +# forwards them to Tempo. Also tails xrpld log files +# via filelog receiver and exports to Loki. Listens on ports +# 4317 (gRPC) and 4318 (HTTP). # - tempo: Grafana Tempo tracing backend, queryable via Grafana Explore # on port 3000. Recommended for production (S3/GCS storage, TraceQL). -# - grafana: dashboards on port 3000, pre-configured with Tempo -# and Prometheus datasources. +# - loki: Grafana Loki log aggregation backend for centralized log +# ingestion and log-trace correlation (Phase 8). +# - grafana: dashboards on port 3000, pre-configured with Tempo, +# Prometheus, and Loki datasources. # # Usage: # docker compose -f docker/telemetry/docker-compose.yml up -d @@ -27,15 +30,19 @@ services: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP (traces + native OTel metrics) - "8889:8889" # Prometheus metrics (spanmetrics + OTLP) - - "13133:13133" # Health check # StatsD UDP port removed — beast::insight now uses native OTLP. # Uncomment if using server=statsd fallback: # - "8125:8125/udp" volumes: # Mount collector pipeline config (receivers → processors → exporters) - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro + # Phase 8: Mount rippled log directory for filelog receiver. + # The integration test writes logs to /tmp/xrpld-integration/; + # mount it read-only so the collector can tail debug.log files. + - /tmp/xrpld-integration:/var/log/rippled:ro depends_on: - tempo + - loki networks: - xrpld-telemetry @@ -54,12 +61,27 @@ services: networks: - xrpld-telemetry + # Phase 8: Grafana Loki for centralized log ingestion and log-trace + # correlation. Loki 3.x supports native OTLP ingestion, so the OTel + # Collector exports via otlphttp to Loki's /otlp endpoint. + # Query logs via Grafana Explore -> Loki at http://localhost:3000. + loki: + image: grafana/loki:3.4.2 + ports: + - "3100:3100" + command: -config.file=/etc/loki/local-config.yaml + volumes: + - loki-data:/loki + networks: + - xrpld-telemetry + prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro + - prometheus-data:/prometheus depends_on: - otel-collector networks: @@ -81,6 +103,7 @@ services: depends_on: - tempo - prometheus + - loki networks: - xrpld-telemetry @@ -89,6 +112,8 @@ services: # docker compose -f docker/telemetry/docker-compose.yml down -v volumes: tempo-data: + prometheus-data: + loki-data: # Isolated bridge network so services communicate by container name # (e.g., the collector reaches Tempo at http://tempo:4317). diff --git a/docker/telemetry/integration-test.sh b/docker/telemetry/integration-test.sh index 7cedd7ac35..55ad13e08c 100755 --- a/docker/telemetry/integration-test.sh +++ b/docker/telemetry/integration-test.sh @@ -64,14 +64,15 @@ check_span() { fi } -# Phase 8: Verify trace_id injection in rippled log output. +# Phase 8: Verify trace_id injection in xrpld log output. # Greps all node debug.log files for the "trace_id= span_id=" # pattern that Logs::format() injects when an active OTel span exists. -# Also cross-checks that a trace_id found in logs matches a trace in Jaeger. +# Also cross-checks that a trace_id found in logs matches a trace in Tempo. check_log_correlation() { log "Checking log-trace correlation..." local total_matches=0 + local files_scanned=0 local sample_trace_id="" for i in $(seq 1 "$NUM_NODES"); do @@ -79,30 +80,35 @@ check_log_correlation() { if [ ! -f "$logfile" ]; then continue fi + files_scanned=$((files_scanned + 1)) local matches - matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile" 2>/dev/null || echo 0) + matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile") || matches=0 total_matches=$((total_matches + matches)) - # Capture the first trace_id we find for cross-referencing with Jaeger if [ -z "$sample_trace_id" ] && [ "$matches" -gt 0 ]; then sample_trace_id=$(grep -o 'trace_id=[a-f0-9]\{32\}' "$logfile" | head -1 | cut -d= -f2) fi done - if [ "$total_matches" -gt 0 ]; then - ok "Log correlation: found $total_matches log lines with trace_id" - else - fail "Log correlation: no trace_id found in any node debug.log" + if [ "$files_scanned" -eq 0 ]; then + fail "Log correlation: no debug.log files found in $WORKDIR/node*/" + return fi - # Cross-check: verify the sample trace_id exists in Jaeger + if [ "$total_matches" -gt 0 ]; then + ok "Log correlation: found $total_matches log lines with trace_id ($files_scanned nodes scanned)" + else + fail "Log correlation: no trace_id found in any node debug.log ($files_scanned nodes scanned)" + fi + + # Cross-check: verify the sample trace_id exists in Tempo if [ -n "$sample_trace_id" ]; then local trace_found - trace_found=$(curl -sf "$JAEGER/api/traces/$sample_trace_id" \ - | jq '.data | length' 2>/dev/null || echo 0) + trace_found=$(curl -sf "$TEMPO/api/traces/$sample_trace_id" \ + | jq '.data | length' 2>/dev/null) || trace_found=0 if [ "$trace_found" -gt 0 ]; then - ok "Log-Jaeger cross-check: trace_id=$sample_trace_id found in Jaeger" + ok "Log-Tempo cross-check: trace_id=$sample_trace_id found in Tempo" else - fail "Log-Jaeger cross-check: trace_id=$sample_trace_id NOT found in Jaeger" + fail "Log-Tempo cross-check: trace_id=$sample_trace_id NOT found in Tempo" fi fi } diff --git a/docker/telemetry/otel-collector-config.yaml b/docker/telemetry/otel-collector-config.yaml index 76e88a4d20..f8c3fe49c1 100644 --- a/docker/telemetry/otel-collector-config.yaml +++ b/docker/telemetry/otel-collector-config.yaml @@ -2,11 +2,27 @@ # # Pipelines: # traces: OTLP receiver -> batch processor -> debug + Tempo + spanmetrics -# metrics: spanmetrics connector -> Prometheus exporter +# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter +# logs: filelog receiver -> batch processor -> otlphttp/Loki (Phase 8) # # xrpld sends traces via OTLP/HTTP to port 4318. The collector batches # them, forwards to Tempo, and derives RED metrics via the spanmetrics # connector, which Prometheus scrapes on port 8889. +# +# xrpld sends beast::insight metrics natively via OTLP/HTTP to port 4318 +# (same endpoint as traces). The OTLP receiver feeds both the traces and +# metrics pipelines. Metrics are exported to Prometheus alongside +# span-derived metrics. +# +# Phase 8: The filelog receiver tails xrpld's debug.log files under +# /var/log/rippled/ (mounted from the host). A regex_parser operator +# extracts timestamp, partition, severity, and optional trace_id/span_id +# fields injected by Logs::format(). Parsed logs are exported to Grafana +# Loki for log-trace correlation. + +extensions: + health_check: + endpoint: 0.0.0.0:13133 receivers: otlp: @@ -15,11 +31,28 @@ receivers: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 + # Phase 8: Filelog receiver tails xrpld debug.log files for log-trace + # correlation. Extracts structured fields (timestamp, partition, severity, + # trace_id, span_id, message) via regex. The trace_id and span_id are + # optional — only present when the log was emitted within an active span. + filelog: + include: [/var/log/rippled/*/debug.log] + operators: + - type: regex_parser + regex: '^(?P\S+\s+\S+)\s+\S+\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' + timestamp: + parse_from: attributes.timestamp + layout: "%Y-%b-%d %H:%M:%S" processors: batch: timeout: 1s send_batch_size: 100 + resource/logs: + attributes: + - key: service.name + value: xrpld + action: upsert connectors: spanmetrics: @@ -45,15 +78,28 @@ exporters: endpoint: tempo:4317 tls: insecure: true + # Phase 8: Export logs to Grafana Loki via OTLP/HTTP. Loki 3.x supports + # native OTLP ingestion on its /otlp endpoint, replacing the removed + # loki exporter (dropped in otel-collector-contrib v0.147.0). + otlphttp/loki: + endpoint: http://loki:3100/otlp prometheus: endpoint: 0.0.0.0:8889 service: + extensions: [health_check] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [debug, otlp/tempo, spanmetrics] metrics: - receivers: [spanmetrics] + receivers: [otlp, spanmetrics] + processors: [batch] exporters: [prometheus] + # Phase 8: Log pipeline ingests xrpld debug.log via filelog receiver, + # batches entries, and exports to Loki for log-trace correlation. + logs: + receivers: [filelog] + processors: [resource/logs, batch] + exporters: [otlphttp/loki] diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index 27a9fee20f..e4a486ca00 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -294,7 +294,7 @@ prefix=xrpld The `OTelCollector` implementation exports metrics via OTLP/HTTP to the same OTel Collector that receives traces. No separate StatsD receiver is needed. -> **Fallback**: Set `server=statsd` and `address=127.0.0.1:8125` to use the legacy StatsD UDP path during the transition period. +> **Fallback**: Set `server=statsd` and `address=127.0.0.1:8125` to use the legacy StatsD UDP path. This requires re-enabling the `statsd` receiver in `otel-collector-config.yaml` and uncommenting port 8125 in `docker-compose.yml`. ### Metric Reference diff --git a/include/xrpl/basics/Log.h b/include/xrpl/basics/Log.h index 5c63166d93..58cca4f486 100644 --- a/include/xrpl/basics/Log.h +++ b/include/xrpl/basics/Log.h @@ -15,7 +15,6 @@ namespace xrpl { // DEPRECATED use beast::severities::Severity instead -// NOLINTNEXTLINE(cppcoreguidelines-use-enum-class) enum LogSeverity { lsINVALID = -1, // used to indicate an invalid severity lsTRACE = 0, // Very low-level progress information, details inside @@ -208,8 +207,6 @@ public: fromString(std::string const& s); private: - // Need to be named before converting - // NOLINTNEXTLINE(cppcoreguidelines-use-enum-class) enum { // Maximum line length for log messages. // If the message exceeds this length it will be truncated with diff --git a/include/xrpl/basics/Mutex.hpp b/include/xrpl/basics/Mutex.hpp index 4432e27b4b..5855ee2017 100644 --- a/include/xrpl/basics/Mutex.hpp +++ b/include/xrpl/basics/Mutex.hpp @@ -131,7 +131,7 @@ public: * @tparam LockType The type of lock to use * @return A lock on the mutex and a reference to the protected data */ - template