Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill

2026-06-03 08:46:46 +00:00 · 2026-04-29 21:09:47 +01:00
parent 9e12e660fe 8e7a2d6c53
commit 1658d3dc40
11 changed files with 70 additions and 3247 deletions
--- a/docs/telemetry-runbook.md
+++ b/docs/telemetry-runbook.md
@@ -277,9 +277,9 @@ Configured in `otel-collector-config.yaml`:
 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
 ```

-## StatsD Metrics (beast::insight)
+## System Metrics (OTel native -- beast::insight)

-xrpld has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
+xrpld has a built-in metrics framework (`beast::insight`) that exports metrics natively via OTLP to the OTel Collector. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.

 ### Configuration

@@ -287,12 +287,14 @@ Add to `xrpld.cfg`:

 ```ini
 [insight]
-server=statsd
-address=127.0.0.1:8125
+server=otel
+endpoint=http://localhost:4318/v1/metrics
 prefix=xrpld
 ```

-The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics.
+The `OTelCollector` implementation exports metrics via OTLP/HTTP to the same OTel Collector that receives traces. No separate StatsD receiver is needed.
+
+> **Fallback**: Set `server=statsd` and `address=127.0.0.1:8125` to use the legacy StatsD UDP path during the transition period.

 ### Metric Reference

@@ -347,7 +349,7 @@ These gauges are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s i
 | `xrpld_warn`                    | Logic.h:33            | Resource manager warning count |
 | `xrpld_drop`                    | Logic.h:34            | Resource manager drop count    |

-#### Histograms (from StatsD timers)
+#### Histograms

 | Prometheus Metric     | Source                | Description                    |
 | --------------------- | --------------------- | ------------------------------ |
@@ -426,7 +428,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
 | Proposals Trusted vs Untrusted   | piechart   | by `xrpl_peer_proposal_trusted`   | `xrpl_peer_proposal_trusted`   |
 | Validations Trusted vs Untrusted | piechart   | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |

-### Node Health -- StatsD (`xrpld-statsd-node-health`)
+### Node Health -- System Metrics (`xrpld-system-node-health`)

 | Panel                                  | Type       | PromQL                                                          | Labels Used      |
 | -------------------------------------- | ---------- | --------------------------------------------------------------- | ---------------- |
@@ -455,7 +457,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
 | Database Sizes                         | timeseries | `xrpld_db_metrics{metric=~"db_kb_.*"}`                          | `metric`         |
 | Historical Fetch Rate                  | stat       | `xrpld_db_metrics{metric="historical_perminute"}`               | `metric`         |

-### Network Traffic -- StatsD (`xrpld-statsd-network`)
+### Network Traffic -- System Metrics (`xrpld-system-network`)

 | Panel                                | Type       | PromQL                                     | Labels Used |
 | ------------------------------------ | ---------- | ------------------------------------------ | ----------- |
@@ -470,7 +472,7 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
 | Duplicate Traffic (Wasted Bandwidth) | timeseries | `rate(xrpld_*_duplicate_Bytes_In/Out[5m])` | —           |
 | All Traffic Categories (Detail)      | timeseries | `topk(15, rate(xrpld_*_Bytes_In[5m]))`     | —           |

-### RPC & Pathfinding -- StatsD (`xrpld-statsd-rpc`)
+### RPC & Pathfinding -- System Metrics (`xrpld-system-rpc`)

 | Panel                     | Type       | PromQL                                                 | Labels Used |
 | ------------------------- | ---------- | ------------------------------------------------------ | ----------- |
@@ -574,6 +576,14 @@ count_over_time({job="xrpld"} |= "trace_id=" [5m])
 5. Verify Tempo is receiving data: open Grafana → Explore → select Tempo datasource → search by `service.name = xrpld`
 6. Check Tempo logs: `docker compose -f docker/telemetry/docker-compose.yml logs tempo`

+### No system metrics in Prometheus
+
+1. Check xrpld logs for `OTelCollector starting` message
+2. Verify `server=otel` in the `[insight]` config section
+3. Verify the endpoint in `[insight]` points to the OTLP/HTTP port (default: `http://localhost:4318/v1/metrics`)
+4. Check that the `otlp` receiver is in the metrics pipeline receivers in `otel-collector-config.yaml`
+5. Query Prometheus directly: `curl 'http://localhost:9090/api/v1/query?query=xrpld_job_count'`
+
 ### Server info gauge shows server_state=0

 This is normal during startup. The server starts in DISCONNECTED mode (0) and