diff --git a/.clang-tidy b/.clang-tidy index 6a7005b464..6a967532db 100644 --- a/.clang-tidy +++ b/.clang-tidy @@ -7,7 +7,6 @@ Checks: "-*, bugprone-bad-signal-to-kill-thread, bugprone-bool-pointer-implicit-conversion, bugprone-casting-through-void, - bugprone-capturing-this-in-member-variable, bugprone-chained-comparison, bugprone-compare-pointer-to-member-virtual-function, bugprone-copy-constructor-init, @@ -29,7 +28,6 @@ Checks: "-*, bugprone-misplaced-operator-in-strlen-in-alloc, bugprone-misplaced-pointer-arithmetic-in-alloc, bugprone-misplaced-widening-cast, - bugprone-misleading-setter-of-reference, bugprone-move-forwarding-reference, bugprone-multi-level-implicit-pointer-conversion, bugprone-multiple-new-in-one-expression, @@ -87,7 +85,6 @@ Checks: "-*, cppcoreguidelines-pro-type-static-cast-downcast, cppcoreguidelines-rvalue-reference-param-not-moved, cppcoreguidelines-use-default-member-init, - cppcoreguidelines-use-enum-class, cppcoreguidelines-virtual-class-destructor, hicpp-ignored-remove-result, misc-const-correctness, @@ -112,7 +109,6 @@ Checks: "-*, modernize-use-nodiscard, modernize-use-override, modernize-use-ranges, - modernize-use-scoped-lock, modernize-use-starts-ends-with, modernize-use-std-numbers, modernize-use-using, @@ -126,7 +122,6 @@ Checks: "-*, performance-move-constructor-init, performance-no-automatic-move, performance-trivially-destructible, - readability-ambiguous-smartptr-reset-call, readability-avoid-nested-conditional-operator, readability-avoid-return-with-void-value, readability-braces-around-statements, diff --git a/.github/scripts/levelization/generate.py b/.github/scripts/levelization/generate.py old mode 100644 new mode 100755 diff --git a/.github/scripts/levelization/results/ordering.txt b/.github/scripts/levelization/results/ordering.txt index b829510b60..778b99f486 100644 --- a/.github/scripts/levelization/results/ordering.txt +++ b/.github/scripts/levelization/results/ordering.txt @@ -193,7 +193,6 @@ test.toplevel > xrpl.json test.unit_test > xrpl.basics test.unit_test > xrpl.protocol tests.libxrpl > xrpl.basics -tests.libxrpl > xrpl.core tests.libxrpl > xrpld.telemetry tests.libxrpl > xrpl.json tests.libxrpl > xrpl.net diff --git a/.github/scripts/rename/README.md b/.github/scripts/rename/README.md index ab685bb0c3..123881094e 100644 --- a/.github/scripts/rename/README.md +++ b/.github/scripts/rename/README.md @@ -1,11 +1,11 @@ ## Renaming ripple(d) to xrpl(d) In the initial phases of development of the XRPL, the open source codebase was -called "rippled" and it remains with that name even today. Today, over 1000 +called "xrpld" and it remains with that name even today. Today, over 1000 nodes run the application, and code contributions have been submitted by developers located around the world. The XRPL community is larger than ever. In light of the decentralized and diversified nature of XRPL, we will rename any -references to `ripple` and `rippled` to `xrpl` and `xrpld`, when appropriate. +references to `ripple` and `xrpld` to `xrpl` and `xrpld`, when appropriate. See [here](https://xls.xrpl.org/xls/XLS-0095-rename-rippled-to-xrpld.html) for more information. @@ -22,17 +22,17 @@ run from the repository root. 2. `.github/scripts/rename/copyright.sh`: This script will remove superfluous copyright notices. 3. `.github/scripts/rename/cmake.sh`: This script will rename all CMake files - from `RippleXXX.cmake` or `RippledXXX.cmake` to `XrplXXX.cmake`, and any - references to `ripple` and `rippled` (with or without capital letters) to + from `RippleXXX.cmake` or `XrpldXXX.cmake` to `XrplXXX.cmake`, and any + references to `ripple` and `xrpld` (with or without capital letters) to `xrpl` and `xrpld`, respectively. The name of the binary will remain as-is, and will only be renamed to `xrpld` by a later script. 4. `.github/scripts/rename/binary.sh`: This script will rename the binary from - `rippled` to `xrpld`, and reverses the symlink so that `rippled` points to + `xrpld` to `xrpld`, and reverses the symlink so that `xrpld` points to the `xrpld` binary. 5. `.github/scripts/rename/namespace.sh`: This script will rename the C++ namespaces from `ripple` to `xrpl`. 6. `.github/scripts/rename/config.sh`: This script will rename the config from - `rippled.cfg` to `xrpld.cfg`, and updating the code accordingly. The old + `xrpld.cfg` to `xrpld.cfg`, and updating the code accordingly. The old filename will still be accepted. 7. `.github/scripts/rename/docs.sh`: This script will rename any lingering references of `ripple(d)` to `xrpl(d)` in code, comments, and documentation. diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 00a71a25c6..783b39f767 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -576,7 +576,187 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. --- -## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) +## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12) + +**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. + +### Motivation: Why Migrate from StatsD to Native OTel Metrics + +The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves. + +#### What We Gain + +1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." + +2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. + +3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). + +4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. + +5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. + +6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. + +7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. + +#### What We Lose + +1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. + +2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. + +3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. + +4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. + +#### Decision + +The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. + +### Architecture + +#### Class Hierarchy (after Phase 7) + +``` +beast::insight::Collector (abstract interface — unchanged) + | + +-- StatsDCollector (existing — retained as fallback, deprecated) + | +-- StatsDCounterImpl -> StatsD |c over UDP + | +-- StatsDGaugeImpl -> StatsD |g over UDP + | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) + | +-- StatsDEventImpl -> StatsD |ms over UDP + | +-- StatsDHookImpl -> 1s periodic callback + | + +-- NullCollector (existing — unchanged, used when disabled) + | +-- NullCounterImpl -> no-op + | +-- NullGaugeImpl -> no-op + | +-- NullMeterImpl -> no-op + | +-- NullEventImpl -> no-op + | +-- NullHookImpl -> no-op + | + +-- OTelCollector (NEW — Phase 7) + +-- OTelCounterImpl -> otel::Counter + +-- OTelGaugeImpl -> otel::ObservableGauge + +-- OTelMeterImpl -> otel::Counter + +-- OTelEventImpl -> otel::Histogram + +-- OTelHookImpl -> 1s periodic callback (same pattern) +``` + +#### Data Flow (after Phase 7) + +```mermaid +graph LR + subgraph xrpldNode["xrpld Node"] + A["Trace Macros
XRPL_TRACE_SPAN"] + B["beast::insight
OTelCollector"] + end + + subgraph collector["OTel Collector :4317 / :4318"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + BP["Batch Processor"] + SM["SpanMetrics Connector"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends"] + D["Jaeger / Tempo"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000"] + end + + A -->|"OTLP/HTTP :4318
(traces)"| R1 + B -->|"OTLP/HTTP :4318
(metrics)"| R1 + + BP -->|"OTLP/gRPC"| D + SM -->|"RED metrics"| E + R1 -->|"xrpld_* metrics
(native OTLP)"| E + + E --> F + D --> F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. + +#### Configuration + +```ini +# [insight] section — new "otel" server option +[insight] +server=otel # NEW: uses OTel OTLP metrics exporter +prefix=xrpld # metric name prefix (preserved) + +# Endpoint and auth inherited from [telemetry] section: +[telemetry] +enabled=1 +endpoint=http://localhost:4318/v1/traces +``` + +The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. + +**Backward compatibility**: `server=statsd` continues to work exactly as before. + +See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. + +### Instrument Type Mapping + +| beast::insight | OTel Metrics SDK | Rationale | +| ---------------------- | -------------------------------- | ---------------------------------------------------------------- | +| Counter (int64, `\|c`) | `Counter` | Direct 1:1 mapping | +| Gauge (uint64, `\|g`) | `ObservableGauge` | Async callback matches existing Hook polling pattern | +| Meter (uint64, `\|m`) | `Counter` | Fixes non-standard wire format; meters are semantically counters | +| Event (ms, `\|ms`) | `Histogram` | Duration distributions with explicit bucket boundaries | +| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval | + +### Tasks + +| Task | Description | +| ---- | ------------------------------------------------------------------------- | +| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) | +| 7.2 | Implement `OTelCollector` class (~400-500 lines) | +| 7.3 | Update `CollectorManager` — add `server=otel` | +| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) | +| 7.5 | Preserve metric names in Prometheus (naming strategy) | +| 7.6 | Update Grafana dashboards (if names change) | +| 7.7 | Update integration tests | +| 7.8 | Update documentation (runbook, reference docs) | + +### Exit Criteria + +- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) +- [ ] `server=otel` is the default in development docker-compose +- [ ] `server=statsd` still works as a fallback +- [ ] Existing Grafana dashboards display data correctly +- [ ] Integration test passes with OTLP-only metrics pipeline +- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) +- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant + +--- + +## 6.9 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) ### Motivation @@ -588,7 +768,7 @@ xrpld's `beast::Journal` logs and OpenTelemetry traces are currently two disjoin 2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo. 3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance. 4. **Zero new dependencies in xrpld** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1. -5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min. +5. **Negligible overhead** — The implementation checks the thread-local context value directly, avoiding heap allocation on the no-span path (~15-20ns). On the active-span path, total cost is ~50ns per log call. At typical logging rates, overhead is negligible. #### Losses / Risks diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 696ad39759..96295fc2ed 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -517,7 +517,7 @@ Example: ### Implementation -The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It reads the current span from OTel's thread-local runtime context via `opentelemetry::trace::GetSpan()` and `opentelemetry::context::RuntimeContext::GetCurrent()`. Both calls are lock-free thread-local reads measured at <10ns per call. +The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It checks the thread-local runtime context value directly (via `RuntimeContext::GetCurrent().GetValue(kSpanKey)`) to avoid the heap allocation that `GetSpan()` performs on the no-span path. On threads without an active span, the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, total cost is ~50ns per log call. ### Log Ingestion Pipeline diff --git a/OpenTelemetryPlan/Phase8_taskList.md b/OpenTelemetryPlan/Phase8_taskList.md index d7c4770584..3f68f2c7ac 100644 --- a/OpenTelemetryPlan/Phase8_taskList.md +++ b/OpenTelemetryPlan/Phase8_taskList.md @@ -24,23 +24,32 @@ **What to do**: - Edit `src/libxrpl/basics/Log.cpp`: - - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span: + - In `Logs::format()` (around line 346), after severity is appended, check for active OTel span. The implementation checks the context value directly to avoid the heap allocation that `GetSpan()` performs on the no-span path: ```cpp #ifdef XRPL_ENABLE_TELEMETRY - auto span = opentelemetry::trace::GetSpan( - opentelemetry::context::RuntimeContext::GetCurrent()); - auto ctx = span->GetContext(); - if (ctx.IsValid()) { - // Append trace context as structured fields - char traceId[33], spanId[17]; - ctx.trace_id().ToLowerBase16(traceId); - ctx.span_id().ToLowerBase16(spanId); - output += "trace_id="; - output.append(traceId, 32); - output += " span_id="; - output.append(spanId, 16); - output += ' '; + auto context = opentelemetry::context::RuntimeContext::GetCurrent(); + auto spanValue = context.GetValue(opentelemetry::trace::kSpanKey); + if (opentelemetry::nostd::holds_alternative< + opentelemetry::nostd::shared_ptr>(spanValue)) + { + auto span = opentelemetry::nostd::get< + opentelemetry::nostd::shared_ptr>(spanValue); + auto spanCtx = span->GetContext(); + if (spanCtx.IsValid()) + { + char traceId[32], spanId[16]; + spanCtx.trace_id().ToLowerBase16( + opentelemetry::nostd::span{traceId}); + spanCtx.span_id().ToLowerBase16( + opentelemetry::nostd::span{spanId}); + output += "trace_id="; + output.append(traceId, 32); + output += " span_id="; + output.append(spanId, 16); + output += ' '; + } + } } #endif ``` @@ -53,7 +62,7 @@ - `src/libxrpl/basics/Log.cpp` -**Performance note**: `GetSpan()` and `GetContext()` are thread-local reads with no locking — measured at <10ns per call. With ~1000 JLOG calls/min, this adds <10us/min of overhead. +**Performance note**: The implementation checks the thread-local context value directly (avoiding the heap allocation that `GetSpan()` performs on the no-span path). On threads without an active span (~99% of log lines), the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, an additional shared_ptr copy + `GetContext()` + `IsValid()` adds ~50ns total. Overhead is negligible at typical logging rates. --- diff --git a/docker/telemetry/docker-compose.yml b/docker/telemetry/docker-compose.yml index 3588713ee3..4fa3292888 100644 --- a/docker/telemetry/docker-compose.yml +++ b/docker/telemetry/docker-compose.yml @@ -4,7 +4,7 @@ # - otel-collector: receives OTLP traces from xrpld, batches and # forwards them to Tempo. Also tails xrpld log files # via filelog receiver and exports to Loki. Listens on ports -# 4317 (gRPC), 4318 (HTTP), and 8125 (StatsD UDP). +# 4317 (gRPC) and 4318 (HTTP). # - tempo: Grafana Tempo tracing backend, queryable via Grafana Explore # on port 3000. Recommended for production (S3/GCS storage, TraceQL). # - loki: Grafana Loki log aggregation backend for centralized log @@ -30,7 +30,6 @@ services: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP (traces + native OTel metrics) - "8889:8889" # Prometheus metrics (spanmetrics + OTLP) - - "13133:13133" # Health check # StatsD UDP port removed — beast::insight now uses native OTLP. # Uncomment if using server=statsd fallback: # - "8125:8125/udp" @@ -74,7 +73,7 @@ services: volumes: - loki-data:/loki networks: - - rippled-telemetry + - xrpld-telemetry prometheus: image: prom/prometheus:latest diff --git a/docker/telemetry/integration-test.sh b/docker/telemetry/integration-test.sh index 4dddcbfa03..d6656d088c 100755 --- a/docker/telemetry/integration-test.sh +++ b/docker/telemetry/integration-test.sh @@ -64,7 +64,7 @@ check_span() { fi } -# Phase 8: Verify trace_id injection in rippled log output. +# Phase 8: Verify trace_id injection in xrpld log output. # Greps all node debug.log files for the "trace_id= span_id=" # pattern that Logs::format() injects when an active OTel span exists. # Also cross-checks that a trace_id found in logs matches a trace in Tempo. @@ -72,6 +72,7 @@ check_log_correlation() { log "Checking log-trace correlation..." local total_matches=0 + local files_scanned=0 local sample_trace_id="" for i in $(seq 1 "$NUM_NODES"); do @@ -79,8 +80,9 @@ check_log_correlation() { if [ ! -f "$logfile" ]; then continue fi + files_scanned=$((files_scanned + 1)) local matches - matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile" 2>/dev/null || echo 0) + matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile") || matches=0 total_matches=$((total_matches + matches)) # Capture the first trace_id we find for cross-referencing with Tempo if [ -z "$sample_trace_id" ] && [ "$matches" -gt 0 ]; then @@ -88,17 +90,22 @@ check_log_correlation() { fi done + if [ "$files_scanned" -eq 0 ]; then + fail "Log correlation: no debug.log files found in $WORKDIR/node*/" + return + fi + if [ "$total_matches" -gt 0 ]; then - ok "Log correlation: found $total_matches log lines with trace_id" + ok "Log correlation: found $total_matches log lines with trace_id ($files_scanned nodes scanned)" else - fail "Log correlation: no trace_id found in any node debug.log" + fail "Log correlation: no trace_id found in any node debug.log ($files_scanned nodes scanned)" fi # Cross-check: verify the sample trace_id exists in Tempo if [ -n "$sample_trace_id" ]; then local trace_found trace_found=$(curl -sf "$TEMPO/api/traces/$sample_trace_id" \ - | jq '.batches | length' 2>/dev/null || echo 0) + | jq '.batches | length' 2>/dev/null) || trace_found=0 if [ "$trace_found" -gt 0 ]; then ok "Log-Tempo cross-check: trace_id=$sample_trace_id found in Tempo" else diff --git a/docker/telemetry/otel-collector-config.yaml b/docker/telemetry/otel-collector-config.yaml index 1940ca61e1..65707d4a28 100644 --- a/docker/telemetry/otel-collector-config.yaml +++ b/docker/telemetry/otel-collector-config.yaml @@ -2,23 +2,20 @@ # # Pipelines: # traces: OTLP receiver -> batch processor -> debug + Tempo + spanmetrics -# metrics: OTLP + StatsD receivers + spanmetrics connector -> Prometheus exporter +# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter # logs: filelog receiver -> batch processor -> otlphttp/Loki (Phase 8) # # xrpld sends traces via OTLP/HTTP to port 4318. The collector batches -# them, forwards to Tempo, and derives RED metrics via the -# spanmetrics connector, which Prometheus scrapes on port 8889. +# them, forwards to Tempo, and derives RED metrics via the spanmetrics +# connector, which Prometheus scrapes on port 8889. # # xrpld sends beast::insight metrics natively via OTLP/HTTP to port 4318 # (same endpoint as traces). The OTLP receiver feeds both the traces and # metrics pipelines. Metrics are exported to Prometheus alongside # span-derived metrics. # -# The StatsD receiver accepts beast::insight metrics from xrpld nodes -# configured with server=statsd in [insight]. It listens on UDP port 8125. -# # Phase 8: The filelog receiver tails xrpld's debug.log files under -# /var/log/xrpld/ (mounted from the host). A regex_parser operator +# /var/log/rippled/ (mounted from the host). A regex_parser operator # extracts timestamp, partition, severity, and optional trace_id/span_id # fields injected by Logs::format(). Parsed logs are exported to Grafana # Loki for log-trace correlation. @@ -34,39 +31,28 @@ receivers: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 - # StatsD receiver — accepts beast::insight metrics from xrpld nodes - # configured with server=statsd in [insight]. Listens on UDP port 8125. - statsd: - endpoint: "0.0.0.0:8125" - aggregation_interval: 15s - enable_metric_type: true - is_monotonic_counter: true - timer_histogram_mapping: - - statsd_type: "timing" - observer_type: "summary" - summary: - percentiles: [0, 50, 90, 95, 99, 100] - - statsd_type: "histogram" - observer_type: "summary" - summary: - percentiles: [0, 50, 90, 95, 99, 100] # Phase 8: Filelog receiver tails xrpld debug.log files for log-trace # correlation. Extracts structured fields (timestamp, partition, severity, # trace_id, span_id, message) via regex. The trace_id and span_id are # optional — only present when the log was emitted within an active span. filelog: - include: [/var/log/xrpld/*/debug.log] + include: [/var/log/rippled/*/debug.log] operators: - type: regex_parser - regex: '^(?P\S+)\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' + regex: '^(?P\S+\s+\S+)\s+\S+\s+(?P\S+):(?P\S+)\s+(?:trace_id=(?P[a-f0-9]+)\s+span_id=(?P[a-f0-9]+)\s+)?(?P.*)$' timestamp: parse_from: attributes.timestamp - layout: "%Y-%b-%d %H:%M:%S.%f" + layout: "%Y-%b-%d %H:%M:%S" processors: batch: timeout: 1s send_batch_size: 100 + resource/logs: + attributes: + - key: service.name + value: xrpld + action: upsert connectors: spanmetrics: @@ -110,12 +96,12 @@ service: processors: [batch] exporters: [debug, otlp/tempo, spanmetrics] metrics: - receivers: [otlp, spanmetrics, statsd] + receivers: [otlp, spanmetrics] processors: [batch] exporters: [prometheus] # Phase 8: Log pipeline ingests xrpld debug.log via filelog receiver, # batches entries, and exports to Loki for log-trace correlation. logs: receivers: [filelog] - processors: [batch] + processors: [resource/logs, batch] exporters: [otlphttp/loki] diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index ba7b53c323..2b7496a1d3 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -294,7 +294,7 @@ prefix=xrpld The `OTelCollector` implementation exports metrics via OTLP/HTTP to the same OTel Collector that receives traces. No separate StatsD receiver is needed. -> **Fallback**: Set `server=statsd` and `address=127.0.0.1:8125` to use the legacy StatsD UDP path during the transition period. +> **Fallback**: Set `server=statsd` and `address=127.0.0.1:8125` to use the legacy StatsD UDP path. This requires re-enabling the `statsd` receiver in `otel-collector-config.yaml` and uncommenting port 8125 in `docker-compose.yml`. ### Metric Reference diff --git a/include/xrpl/basics/Log.h b/include/xrpl/basics/Log.h index 5c63166d93..58cca4f486 100644 --- a/include/xrpl/basics/Log.h +++ b/include/xrpl/basics/Log.h @@ -15,7 +15,6 @@ namespace xrpl { // DEPRECATED use beast::severities::Severity instead -// NOLINTNEXTLINE(cppcoreguidelines-use-enum-class) enum LogSeverity { lsINVALID = -1, // used to indicate an invalid severity lsTRACE = 0, // Very low-level progress information, details inside @@ -208,8 +207,6 @@ public: fromString(std::string const& s); private: - // Need to be named before converting - // NOLINTNEXTLINE(cppcoreguidelines-use-enum-class) enum { // Maximum line length for log messages. // If the message exceeds this length it will be truncated with diff --git a/include/xrpl/basics/Mutex.hpp b/include/xrpl/basics/Mutex.hpp index 4432e27b4b..5855ee2017 100644 --- a/include/xrpl/basics/Mutex.hpp +++ b/include/xrpl/basics/Mutex.hpp @@ -131,7 +131,7 @@ public: * @tparam LockType The type of lock to use * @return A lock on the mutex and a reference to the protected data */ - template