6.5 KiB
Fix Telemetry Validation Checks
Context
The CI pipeline infrastructure is fully operational (build + deploy + run). However,
the validate_telemetry.py validation suite fails 35 checks due to mismatches between
what the validation expects and what the telemetry stack actually produces. These fall
into 4 categories.
CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191
Category 1: StatsD Metrics — 0 Series (25 failures)
Symptoms:
[FAIL] metric.statsd_gauges.xrpld_LedgerMaster_Validated_Ledger_Age: 0 series
[FAIL] metric.statsd_counters.xrpld_rpc_requests: 0 series
[FAIL] metric.statsd_histograms.xrpld_rpc_time: 0 series
[FAIL] metric.overlay_traffic.xrpld_total_Bytes_In: 0 series
[FAIL] metric.phase9_nodestore.xrpld_nodestore_reads_total: 0 series
... (25 total)
Root Cause: Two issues compounding:
-
StatsD receiver is commented out in
otel-collector-config.yaml(lines 39-54). The collector config was updated to expect native OTLP metrics from beast::insight (comment: "StatsD UDP port removed — beast::insight now uses native OTLP"), but the validation harness configures xrpld nodes withserver=statsd. -
Metric name mismatch: The
expected_metrics.jsonexpects StatsD-style metric names (e.g.,xrpld_LedgerMaster_Validated_Ledger_Age). When usingserver=otel, beast::insight emits OTLP metrics which may have different names/structure.
Fix Options (pick one):
-
Option A (recommended): Change the node config in
run-full-validation.shfromserver=statsdtoserver=otel(line 255), remove theaddress=127.0.0.1:8125line, then updateexpected_metrics.jsonwith the actual OTLP metric names. This aligns with the collector config's OTLP-first design and avoids re-enabling the StatsD receiver. -
Option B: Uncomment the StatsD receiver in
otel-collector-config.yaml, addstatsdto the metrics pipeline receivers list, and keep node config asserver=statsd. Simpler but goes against the migration to native OTLP.
Investigation needed for Option A:
- Run xrpld locally with
server=otel, query Prometheus, and capture the actual OTLP metric names to updateexpected_metrics.json.
Files to modify:
docker/telemetry/workload/run-full-validation.sh— change[insight]sectiondocker/telemetry/workload/expected_metrics.json— update metric names for OTLPdocker/telemetry/workload/validate_telemetry.py— may need metric query adjustments
Category 2: Missing Spans — tx.process, tx.receive (2 failures)
Symptoms:
[FAIL] span.tx.process: tx.process: 0 traces (expected > 0)
[FAIL] span.tx.receive: tx.receive: 0 traces (expected > 0)
Root Cause: The span names exist in the code:
src/xrpld/app/misc/NetworkOPs.cpp:1228—XRPL_TRACE_TX("tx.process")src/xrpld/overlay/detail/PeerImp.cpp:1273—XRPL_TRACE_TX("tx.receive")
Likely causes (investigate in order):
- Batch delay: The 2-second batch delay (
batch_delay_ms=2000) plus 30s propagation wait may not be enough if these spans are created late in the workload. - Code path not triggered:
tx.processfires inNetworkOPs::processTransaction(). The tx_submitter submits via RPCsubmitcommand which calls this path. But if the transactions fail validation before reachingprocessTransaction(), no span is emitted. - Span naming mismatch: The validation queries Tempo for exact operation name
tx.process. Verify Tempo stores the span with this exact name.
Investigation:
- Check the tx_submitter output in CI logs — are transactions actually succeeding?
- Query Tempo API locally for all span names to see what's actually emitted.
Files to modify:
- Possibly
docker/telemetry/workload/validate_telemetry.py— adjust timing/queries - Possibly
docker/telemetry/workload/run-full-validation.sh— increase propagation wait
Category 3: Span Hierarchy — rpc.request -> rpc.process (1 failure)
Symptoms:
[FAIL] span.hierarchy.rpc.request->rpc.process: rpc.process not found in rpc.request traces
Root Cause: The validator fetches traces containing rpc.request from Tempo and
checks if any child span is named rpc.process. Both spans are emitted (they pass
individual checks), but the parent-child relationship isn't established.
Investigation:
- Check
src/xrpld/rpc/detail/ServerHandler.cpp—rpc.request(line 271) andrpc.process(line 573) are in the same file. Verify thatrpc.processis created as a child ofrpc.request(i.e., its parent context is set). - The issue may be that
rpc.processcreates a new root span instead of linking to therpc.requestspan context.
Files to modify:
- Possibly
src/xrpld/rpc/detail/ServerHandler.cpp— fix span parenting - OR
docker/telemetry/workload/validate_telemetry.py— if hierarchy check logic is wrong
Category 4: Dashboard 404s (5 failures)
Symptoms:
[FAIL] dashboard.xrpld-statsd-node-health: HTTP 404
[FAIL] dashboard.xrpld-statsd-network: HTTP 404
[FAIL] dashboard.xrpld-statsd-rpc: HTTP 404
[FAIL] dashboard.xrpld-statsd-overlay-detail: HTTP 404
[FAIL] dashboard.xrpld-statsd-ledger-sync: HTTP 404
Root Cause: Dashboard UIDs were renamed from xrpld-statsd-* to xrpld-system-*
but expected_metrics.json still references the old names.
Actual UIDs in docker/telemetry/grafana/dashboards/:
| Expected (in expected_metrics.json) | Actual (in dashboard JSON) |
|---|---|
xrpld-statsd-node-health |
xrpld-system-node-health |
xrpld-statsd-network |
xrpld-system-network |
xrpld-statsd-rpc |
xrpld-system-rpc |
xrpld-statsd-overlay-detail |
xrpld-system-overlay-detail |
xrpld-statsd-ledger-sync |
xrpld-system-ledger-sync |
Fix: Update the 5 UIDs in expected_metrics.json → grafana_dashboards.uids[].
Files to modify:
docker/telemetry/workload/expected_metrics.json— update dashboard UIDs
Execution Order
- Category 4 (Dashboard UIDs) — trivial rename, no investigation needed
- Category 1 (StatsD/OTLP metrics) — requires investigation to choose Option A vs B and capture actual metric names
- Category 2 (Missing tx spans) — requires investigation into transaction code paths
- Category 3 (Span hierarchy) — requires investigation into span context propagation
Branch
All changes go on: pratik/otel-phase10-workload-validation
Worktree: /tmp/otel-phase10-iter